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(54) Title: SEQUENCING BY LIGATION OF ENCODED ADAPTORS 
(57) Abstract 

The invention provides a method of nucleic acid sequence analysis based on the ligation of one or more sets of encoded adaptors to the 
terminus of a target polynucleotide. Encoded adaptors whose protruding strands form perfectly matched duplexes with the complementary 
protruding strands of the target polynucleotide are ligated. and the identity of the nucleotides in the protruding strands is determined by an 
oligonucleotide tag carried by the encoded adaptor. Such determination, or "decoding" is carried out by specifically hybridizing a labeled 
tag complement to its corresponding tag on the ligated adaptor, wwreu 
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FiftlH of the Invention 

The invention relates generally to methods for determining the nucleotide sequence of 
a polynucleotide, and more particularly, to a method of identifying terminal nucleotides of a 
polynucleotide by specific ligation of encoded adaptors. 



BACKGROUND 

The DN A sequencing methods of choice for nearly all scientific and commercial 
applications are based on the dideoxy chain termination approach pioneered by Sanger, e.g. 
Sanger ct al, Proc. Natl. Acad. Sci., 74: 5463-5467 (1977). The method has been improved 
1 5 in several ways and, in a variety of forms, is used in all commercial DN A sequencing 
instruments, e.g. Hunkapillcr et al, Science, 254: 59-67 (1991) 

The chain termination method requires the generation of one or more sets of labeled 
DNA fragments, each having a common origin and each terminating with a known base. The 
set or sets of fragments must then be separated by size to obtain sequence information. The 
20 size separation is usually accomplished by high resolution gel electrophoresis, which must 
have the capacity of distinguishing very large fragments differing in size by no more than a 
single nucleotide. Despite many significant improvements, such as separations with capillary 
arrays and the use of non-gel clcctrophorctic separation mediums, the technique docs not 
readily lend itself to miniaturization or to massively parallel implementation. 
25 As an alternative to the Sangcr-bascd approaches to DNA sequencing, several so- 

called "basc-by-basc" or "single base" sequencing approaches have been explored, e.g. 
Chceseman, U.S. patent 5,302,509; Tsien et al, International application WO 91/06678; 
Rosenthal et al, International application WO 93/21340; Canard ct al. Gene, 148: 1-6 (1994); 
and Metzker et al, Nucleic Acids Research, 22: 4259-4267 (1994). These approaches are 
30 characterized by the determination of a single nucleotide per cycle of chemical or biochemical 
operations and no requirement of a separation step. Thus, if they could be implemented as 
conceived, "base-by-base" approaches promise the possibility of carrying out many thousands 
of sequencing reactions in parallel, for example, on target polynucleotides attached to 
microparticles or on solid phase arrays, e.g. International patent application 
35 PCT/US95/12678 (WO 96/12039). 

Unfortunately, "basc-by-base" sequencing schemes have not had widespread 
application because of numerous problems, such as inefficient chemistries which prevent 
determination of any more than a few nucleotides in a complete sequencing operation. 
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Moreover, in base-by-basc approaches that require enzymatic manipulations, further 
problems 'arise with instrumental used for automated processing. When a series of 
enzymatic steps arc carried out in reaction chambers having high surface-to-volume ratios and 
narrow channel dimensions, enzymes may stick to surface components making washes and 
5 successive processing steps very difficult. The accumulation of protein also affects molecular 
reporter systems, particularly those employing fluorescent labels, and renders the 
interpretation of measurements based on such systems difficult and inconvenient. These and 
similar difficulties have significantly slowed the application of "base-by-basc" sequencing 
| schemes to parallel sequencing efforts 

S 10 An important advance in base-by-base sequencing technology could be made, 

especially in automated systems, if an alternative approach was available for determining the 
t terminal nucleotides of polynucleotides that minimized or eliminated repetitive processing 

cycles employing multiple enzymes. 

jr Summary o f the Invention 

Accordingly, an object of our invention is to provide a DNA sequencing scheme 
which docs not suffer the drawbacks of current base-by-basc approaches. 

Another object of our invention is to provide a method of DNA sequencing which is 
\ amenable to parallel, or simultaneous, application to thousands of DNA fragments present in 

20 a common reaction vessel. 

A further object of our invention is to provide a method of DNA sequencing which 
permits the identification of a terminal portion of a target polynucleotide with minimal 
enzymatic steps. 

Yet another object of our invention is to provide a set of encoded adaptors for 
25 identifying the sequence of a plurality of terminal nucleotides of one or more target 
polynucleotides. 

Our invention provides these and other objects by providing a method of nucleic acid 
sequence analysis based on the ligation of one or more sets of encoded adaptors to a terminus 
of a target polynucleotide (or to the termini of multiple target polynucleotides when used in a 
30 parallel sequencing operation). Each encoded adaptor comprises a protruding strand and an 
* oligonucleotide tag selected from a minimally cross-hybridizing set of oligonucleotides. 

Encoded adaptors whose protruding strands form perfectly matched duplexes with the 
% complementary protruding strands of the target polynucleotide arc ligatcd. After ligation, the 

" identity and ordering of the nucleotides in the protruding strands are determined, or 

3 5 "decoded," by specifically hybridizing a labeled tag complement to its corresponding tag on 
the ligated adaptor. 

For example, if an encoded adaptor with a protruding strand of four nucleotides, say 
5-AGGT, forms a perfectly matched duplex with the complementary protruding strand of a 
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target polynucleotide and is ligatcd, the four complementary nucleotides, 3'-TCCA, on the 
polynucleotide may be identified by a unique oligonucleotide tag selected from a set of 256 
such tags, one for every possible four nucleotide sequence of the protruding strands. Tag 
complements are applied to the ligatcd adaptors under conditions which allow specific 
hybridization of only those tag complements that form perfectly matched duplexes (or 
triplexes) with the oligonucleotide tags of the ligated adaptors. The tag complements may be 
applied individually or as one or more mixtures to determine the identity of the oligonucleotide 
tags, and therefore, the sequences of the protruding strands. 

As explained more fully below, the encoded adaptors may be used in sequence 
analysis cither (i) to identify one or more nucleotides as a step of a process that involves 
repeated cycles of ligation, identification, and cleavage, as described in Brenner U.S. patent 
5,599,675 and PCT Publication No. WO 95/27080, or (ii) as a "stand alone" identification 
method, wherein sets of encoded adaptors are applied to target polynucleotides such that each 
set is capable of identifying the nucleotide sequence of a different portion of a target 
polynucleotide; that is, in the latter embodiment, sequence analysts is carried out with a single 
ligation for each set followed by identification. 

An important feature of the encoded adaptors is the use of oligonucleotide tags that 
are members of a minimally cross-hybridizing set of oligonucleotides, e.g. as described in 
International patent applications PCT/US95/12791 (WO 96/12041) and PCT/US96/09513 
(WO 96/4 1011). The sequences of oligonucleotides of such a set differ from the sequences of 
every other member of the same set by at least two nucleotides. Thus, each member of such a 
set cannot form a duplex (or triplex) with the complement of any other member with less than 
two mismatches. Preferably, each member of a minimally cross-hybridizing set differs from 
every other member by as many nucleotides as possible consistent with the size of set required 
for a particular application. For example, where longer oligonucleotide tags arc used, such as 
1 2- to 20-mcrs for delivering labels to encoded adaptors, then the difference between members 
of a minimally cross-hybridizing set is preferably significantly greater than two. Preferably, 
each member of such a set differs from every other member by at least four nucleotides. 
More preferably, each member of such a set differs from every other member by at least six 
nucleotides. Complements of oligonucleotide tags of the invention are referred to herein as 
"tag complements." 

Oligonucleotide tags may be single stranded and be designed for specific 
hybridization to single stranded tag complements by duplex formation. Oligonucleotide tags 
may also be double stranded and be designed for specific hybridization to single stranded tag 
complements by triplex formation. Preferably, the oligonucleotide tags of the encoded 
adaptors are double stranded and their tag complements are single stranded, such that specific 
hybridization of a tag with its complements occurs through the formation of a triplex 
structure. 
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Preferably, the method of the invention comprises the following steps: (a) ligating an 
encoded adaptor to an end of a polynucleotide, the adaptor having an oligonucleotide tag 
selected from a minimally cross-hybridizing set of oligonucleotides and a protruding strand 
complementary to a protruding strand of the polynucleotide; and (b) identifying one or more 
5 nucleotides in the protruding strand of the polynucleotide by specifically hybridizing a tag 
complement to the oligonucleotide tag of the encoded adaptor. 

Brief Description of the Drawings 
Figures 1 A- IE diagrammatically illustrate the use of encoded adaptors to determine 
10 the terminal nucleotides sequences of a plurality of tagged polynucleotides. 

Figure 2 illustrates the phenomena of self-ligation of identical polynucleotides that are 
anchored to a solid phase support. 

Figure 3A illustrates steps in a preferred method of the invention in which a double 
stranded adaptor having a blocked 3* carbon is ligated to a target polynucleotide. 
1 5 Figure 3B illustrates the use of the preferred embodiment in a method of DNA 

sequencing by stepwise cycles of ligation and cleavage. 

Figure 4 illustrates data from the determination of the terminal nucleotides of a test 
polynucleotide using the method of the present invention. 

Figure 5 is a schematic representation of a flow chamber and detection apparatus for 
20 observing a planar array of microparticlcs loaded with cDN A molecules for sequencing. 



) 



Definitions 

As used herein, the term "encoded adaptor" is used synonymously with the term 
"encoded probe" of priority document U.S. patent application Scr. No. 08/689,587. 
25 As used herein, the term "ligation" means the formation of a covalcnt bond between 

the ends of one or more (usually two) oligonucleotides. The term usually refers to the 
formation of a phosphodiestcr bond resulting from the following reaction, which is usually 
catalyzed by a ligase: 

30 oligoi(5')-OP(0-)(=0)0 + HO-(3')ohgo 2 -5' ollgo l (5>OP(0-)(-0)0.(3 , )olig02-5 , 

where oligoj and olig02 are cither two different oligonucleotides or different ends of the same 
oligonucleotide. The term encompasses non-enzymatic formation of phosphodiester bonds, as 
well as the formation of non-phosphodiestcr covalent bonds between the ends of 
35 oligonucleotides, such as phosphorothioate bonds, disulfide bonds, and the like. A ligation 
reaction is usually template driven, in that the ends of oligO] and o!ig02 are brought into 
juxtaposition by specific hybridization to a template strand. A special case of template-driven 

-4 - 



BNSDOCID. <WO S746704A1 J > 



3 



15 



WO 97/46704 PCT/US97/09472 

ligation is the ligation of two double stranded oligonucleotides having complementary 
protruding strands. 

"Complement" or "tag complement" as used herein in reference to oligonucleotide 
tags refers to an oligonucleotide to which a oligonucleotide tag specifically hybridizes to form 
5 a perfectly matched duplex or triplex. In embodiments where specific hybridization results in 
a triplex, the oligonucleotide tag may be selected to be either double stranded or single 
stranded. Thus, where triplexes are formed, the term "complement" is meant to encompass 
either a double stranded complement of a single stranded oligonucleotide tag or a single 
stranded complement of a double stranded oligonucleotide tag. 
1 0 The term "oligonucleotide" as used herein includes linear oligomers of natural or 

modified monomers or linkages, including deoxyribonuclcosidcs, ribonucleosides, anomeric 
forms thereof, peptide nucleic acids (PNAs), and the like, capable of specifically binding to a 
target polynucleotide by way of a regular pattern of monomer-to-monomcr interactions, such 
as Watson-Crick type of base pairing, base stacking, Hoogsteen or reverse Hoogsteen types 
of base pairing, or the like. Usually monomers are linked by phosphodiester bonds or analogs 
thereof to form oligonucleotides ranging in size from a few monomeric units, e.g. 3-4, to 
^ several tens of monomeric units, e.g. 40-60. Whenever an oligonucleotide is represented by a 

f sequence of letters, such as "ATGCCTG," it will be understood that the nucleotides are in 

5'->3' order from left to right and that "A" denotes dcoxyadenosinc, "C" denotes 
20 dcoxycytidine, "G" denotes dcoxyguanosine, and "T" denotes thymidine, unless otherwise 
noted. Usually ohgonucleotides of the invention comprise the four natural nucleotides; 
however, they may also comprise non-natural nucleotide analogs. It is clear to those skilled ui 
the art when oligonucleotides having natural or non-natural nucleotides may be employed, e.g. 
where processing by enzymes is called for, usually oligonucleotides consisting of natural 
25 nucleotides are required. 

"Perfectly matched" in reference to a duplex means that the poly- or oligonucleotide 
strands making up the duplex form a double stranded structure with one other such that every 
nucleotide in each strand undergoes Watson-Crick basepairing with a nucleotide in the other 
strand. The term also comprehends the pairing of nucleoside analogs, such as dcoxyinosine, 
nucleosides with 2-aminopurine bases, and the like, that may be employed. In reference to a 
triplex, the term means that the triplex consists of a perfectly matched duplex and a third 
strand in which every nucleotide undergoes Hoogsteen or reverse Hoogsteen association with 
a basepair of the perfectly matched duplex. Conversely, a "mismatch" in a duplex between a 
tag and an oligonucleotide means that a pair or triplet of nucleotides in the duplex or triplex 
fails to undergo Watson-Crick and/or Hoogsteen and/or reverse Hoogsteen bonding. 

As used herein, "nucleoside" includes the natural nucleosides, including 2'-deoxy and 
r-hydroxyl forms, e.g. as described in Romberg and Baker, DNA Replication, 2nd Ed. 
4 (Freeman, San Francisco, 1992). "Analogs" in reference to nucleosides includes synthetic 
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nucleosides having modified base moieties and/or modified sugar moieties, e.g. described by 
Scheit, Nucleotide Analogs (John Wiley, New York, 1980); Uhlman and Peyman, Chemical 
Reviews, 90: 543-584 (1990), or the like, with the only proviso that they are capable of 
specific hybridization. Such analogs include synthetic nucleosides designed to enhance 
binding properties, reduce complexity, increase specificity, and the like. 

As used herein "sequence determination" or "determining a nucleotide sequence" in 
reference to polynucleotides includes determination of partial as well as full sequence 
information of the polynucleotide. That is, the term includes sequence comparisons, 
fingerprinting, and like levels of information about a target polynucleotide, as well as the 
express identification and ordering of nucleosides, usually each nucleoside, in a target 
polynucleotide. The term also includes the determination of the identification, ordering, and 
locations of one, two, or three of the four types of nucleotides within a target polynucleotide. 
For example, in some embodiments sequence determination may be effected by identifying the 
ordering and locations of a single type of nucleotide, e.g. cytosines, within the target 
polynucleotide "CATCGC ..." so that its sequence is represented as a binary code, e.g. 
"100101 ... " for "C-(not C)-(not Q-C-(not C)-C ... " and the like. 

As used herein, the term "complexity" in reference to a population of polynucleotides 
means the number of different species of molecule present in the population. 

DETAILED DESCRIPTION OF THE INVENTION 
The invention involves the ligation of encoded adaptors specifically hybridized to the 
terminus or termini of one or more target polynucleotides. Sequence information about the 
region where specific hybridization occurs is obtained by "decoding" the oligonucleotide tags 
of the encoded adaptors thus ligated. In one aspect of the invention, multiple sets of encoded 
adaptors are ligated to a target polynucleotide at staggered cleavage points so that the encoded 
adaptors provide sequence information from each of a plurality of portions of the target 
polynucleotide. Such portions may be disjoint, overlapping or contiguous; however, 
preferably, the portions arc contiguous and together permit the identification of a sequence of 
nucleotides equal to the sum of the lengths of the individual portions. In this aspect, there is 
only a single ligation of encoded adaptors followed by identification by "de-coding" the tags 
of the ligated adaptors. In another aspect of the invention, encoded adaptors arc employed as 
an identification step in a process involving repeated cycles of ligation, identification, and 
cleavage, described more fully below. 

In the latter embodiment, the invention makes use of nucleases whose recognition 
sites arc separate from their cleavage sites. Preferably, such nucleases are type lis restriction 
cndonucleascs. The nucleases are used to generate protruding strands on target 
polynucleotides to which encoded adaptors arc ligated. The amount of sequence information 
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obtained in a given embodiment of the invention depends in part on how many such nucleases 
are employed and the length of the protruding strand produce upon cleavage. 

An important aspect of the invention is the capability of sequencing many target 
polynucleotides in parallel. In this aspect, the method of the invention comprises the 
5 following steps: (a) attaching a first oligonucleotide tag from a repertoire of tags to each 

polynucleotide in a population of polynucleotides such that each first oligonucleotide tag from 
the repertoire is selected from a first minimally cross-hybridizing set; (b) sampling the 
population of polynucleotides such that substantially all different polynucleotides in the 
population have different first oligonucleotide tags attached; (c) ligating one or more encoded 

1 0 adaptors to an end of each of the polynucleotides in the population, each encoded adaptor 
having a second oligonucleotide tag selected from a second minimally cross-hybridizing set 
and a protruding strand complementary to a protruding strand of a polynucleotide of the 
population; (d) sorting the polynucleotides of the population by specifically hybridizing the 
first oligonucleotide tags with their respective complements, the respective complements being 

1 5 attached as uniform populations of substantially identical oligonucleotides in spatially discrete 
regions on the one or more solid phase supports; and (e) identifying one or more nucleotides in 
said protruding strands of the polynucleotides by specifically hybridizing a tag complement to 
each second oligonucleotide tag of the one or more encoded adaptors. In this embodiment, the 
one or more encoded adaptors may be ligatcd to an end of the polynucleotides cither before or 

20 after the polynucleotides have been sorted onto solid phase supports by the first 

oligonucleotide tags. In the preferred embodiment, encoded adaptors include a type lis 
restriction cndonuclcasc site which permits the encoded adaptors to be cleaved from the 
polynucleotides and the polynucleotides shortened after sequence identification. 

In accordance with the preferred embodiment, the method further includes repeated 

25 cycles of ligation, identification, and cleavage, such that one or more nucleotides arc identified 
in each cycle. Preferably, from 2 to 6 nucleotides arc identified and their ordering determined 
in each cycle 

Sequence Analysis without Cycles of Ligation and Cleavage 
30 Sequence analysis without cycles of ligation and cleavage in accordance with the 

invention is illustrated by the embodiment of Figures 1A through IE. In this embodiment, k 
target polynucleotides arc prepared as described below, and also in Brenner, International 
patent applications PCT/US95/12791 (WO 96/1204 1) and PCT/US96/095 13 (WO 
96/4101 1).. That is, a sample is taken from a population of polynucleotides conjugated with 
3 5 oligonucleotide tags designated by small Ts. H These tags arc sometimes referred to as 

oligonucleotide tags for sorting, or as "first 1 ' oligonucleotide tags. The tag-poiynuclcotidc 
conjugates of the sample arc amplified, e.g. by polymerase chain reaction (PCR) or by 
cloning, to give 1 through k populations of conjugates, indicated by (14)-(18) in Fig. 1A. 
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Preferably, the ends of the conjugates opposite of the (small "t") tags are prepared for ligating 
one or more adaptors, each of which contains a recognition site for a nuclease whose cleavage 
site is separate from its recognition site. In the illustrated embodiment, three such adaptors, 
referred to herein as "cleavage adaptors," arc employed. The number of such adaptors 
5 employed depends on several factors, including the amount of sequence information desired, 
the availability of type Us nucleases with suitable reaches and cleavage characteristics, and 
the like. Preferably, from one to three cleavage adaptors are used, and the adaptors are design 
to accommodate different type lis nucleases capable of generating protruding strands of at 
least four-nucleotides after cleavage. 

10 If the method of the invention is being applied to signature sequencing of a cDNA 

population, then prior to ligation of the cleavage adaptors, the tag-polynucleotide conjugates 
may be cleaved with a restriction cndonucleasc with a high frequency of recognition sites, 
such as Taq I, Alu I, HinPl I, Dpn II, Nla III, or the like For enzymes, such as Alu I, that 
leave blunt ends, a staggered end may be produced with T4 DNA polymerase, e.g. as 

15 described in Brenner, International patent application PCT/US95/ 12791 cited above, and 
Kuijpcr ct al, Gene, 1 12: 147-155 (1992). If the target polynucleotides arc prepared by 
cleavage with Taq I, then the following ends are available for ligation: 



20 cgannnn ... -3 1 

tnnnn ... -5 ' 



Thus, an exemplary set of three cleavage adaptors may be constructed as follows: 



25 



(1) NN ... NGAAGA cgannnnnnnnnnnnnnnnnnn ... -3' 

NN ... NCTTCTGCp t nnnn nnnnnnnnnnnnnnn ... -5 f 



30 (2) NN ... NGCAGCA cgannnnnnnnnnnnnnnnnnn ... -3' 

NN ... NCGTCGTGCp tnnnnnnnnnnnnnnnnnnn ... -5 ' 



(3) NN ... NGGGA cgannnnnnnnnnnnnnnnnnn ... -3' 

35 NN . . . N CCCTG Cp tnnnnnnnn nnnn nnnnnnn ... -5 f 

where cleavage adaptors (1), (2), and (3) arc shown in capital letters with the respective 
recognition sites of nucleases Bbs I, Bbv I, and Bsm FI underlined and a 5' phosphate 
indicated as "p." The double underlined portions of the target polynucleotide indicate the 
40 positions of the protruding strands after ligation and cleavage In all cases, the target 

polynucleotide is left with a 5* protruding strand of four nucleotides. Clearly, many different 
embodiments can be constructed using different numbers and kinds of nucleases. As 
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discussed in Brenner, U.S. patent 5,599,675 and WO 95/27080, preferably prior to cleavage 
internal Bbs I, Bbv I, and Bsm FI sites are blocked, e.g. by mcthylation, to prevent 
undesirable cleavages at internal sites of the target polynucleotide. 

Returning to the illustrated embodiment, cleavage adaptors Aj, A 2 , and A 3 are 
ligated (20) in a concentration ratio of 1 : 1 : 1 to the k target polynucleotides to give the 
conjugates shown in Fig. IB, such that within each population of tag-polynucleotidc 
conjugates there arc approximately equal numbers of conjugates having A,, A 2 , and A 3 
attached. After ligation (20), the target polynucleotides are successively cleaved with each of 
the nucleases of the cleavage adaptors and ligated to a set of encoded adaptors First the 
target polynucleotides are cleaved (22) with the nuclease of cleavage adaptor A, after wluch a 
*j first set of encoded adaptors arc ligated to the resulting protruding strands. The cleavage 

results in about a third of the target polynucleotides of each type, i.e. t] , t 2 t k being 
H avauable for hgation. Preferably, the encoded adaptors are apphed as one or more mixtures 

of adaptors which taken together contain every possible sequence of a protruding strand 
Reaction conditions arc selected so that only encoded adaptors whose protruding strands form 
perfectly matched duplexes with those of the target polynucleotide are hgated to form encoded 
conjugates (28), (30), and (32) (Fig. 1C). The capital TV w,th subsenpts indicate that 
umquc oligonucleotide tags are carried by the encoded adaptors for labeling The 
oligonucleotide tags carried by encoded adaptors are sometimes referred to tags for delivering 
labels to the encoded adaptors, or as "second" oligonucleotide tags. As decribed more fullv 
below, smgle stranded oligonucleotide tags used for sorting preferably consist of only three of 
the four nucleotides, so that a T4 DNA polymerase "stripping" reaction, e.g. Kuypcr et al 
(cted above), can be used to prepare target polynucleotides for loading onto solid phase 
supports. On the other hand, oligonucleotide tags employed for delivering labels may cons.st 
25 of all for nucleotides. 

As mentioned above, encoded adaptors comprise a protruding strand (24) and an 
oligonucleotide tag (26). Tlius, if the "A, - cleavage of the t, -polynuc.eot.de conjugates 
results in the following ends: 



15 



20 



30 



5 ' - ... nnnnnnnnn 
3'- ... nnnnnnnnnacct 



then oligonucleotide tag T 24 could have the following structure (SEQ ID NO: 1). 



35 



tggattctagagagagagagagagagag -3' 
aagatctctctctctctctctctc 



where the doub.e stranded portion may be one of a set of 48 (=12 nucleotide positions x 4 
k,nds ofnucleot.de) double stranded 20-mer oligonucleotide tags that forms a perfectly 
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matched triplex with a unique tag complement and forms a triplex with at least 6 mismatches 
with all other tag complements. The encoded adaptors in this example may be ligatcd to the 
target polynucleotides in one or more mixtures of a total of 768 (3 x 256) members. 
Optionally, an encoded adaptor may also comprise a spacer region, as shown in the above 
5 example where the 4 nucleotide sequence "ttct" serves as a spacer between the protruding 

strand and the oligonucleotide tag. 

After ligation of the first set of encoded adaptors (28), (30), and (32), the tag- 
polynucleotide conjugates are cleaved (34) with the nuclease of cleavage adaptor A 2 , after 
which a second set of encoded adaptors is applied to form conjugates (36), (38), and (40) 
10 (Fig. ID). Finally, the tag-polynucleotidc conjugates are cleaved (42) with the nuclease of 
cleavage adaptor A 3 , after which a third set of encoded adaptors is applied to form conjugates 
(44), (46). and (48) (Fig. IE). After completion of the succession of cleavages and ligations 
of encoded adaptors, the mixture is loaded (50) onto one or more solid phase supports via 
oligonucleotide tags t, through t k as described more fully below, and as taught by Brenner, 
15 e.g. PCTAJS95/12791 or PCT7US96/095 13. If a single target polynucleotide is analyzed, 
then clearly multiple ol.gonucleotide tags, tj. t 2 , ... t k , arc not necessary. In such an 
embodiment, biotin, or like moiety, can be employed to anchor the polynucleotide-cncoded 
adaptor conjugate, as no sorting is required. Also, the ordering of the steps of cleavage, 
ligation, and loading onto solid phase supports depends on the particular embodiment 
20 implemented. For example, the tag-polynucleotide conjugates may be loaded onto solid phase 
support first, followed by ligation of cleavage adaptors, cleavage thereof, and ligation of 
encoded adaptors; or, the cleavage adaptors may be ligatcd first, followed by loading, 
cleavage, and ligation of encoded adaptors; and so on. 

After encoded adaptors arc ligated to the ends of a target polynucleotide in 
25 accordance with the invention, sequence information is obtained by successively applying 
labeled tag complements to the immobilized target polynucleotides, cither individually or as 
mixtures under conditions that permit the formation of perfectly matched duplexes and/or 
triplexes between the oligonucleotide tags of the encoded adaptors and their respective tag 
complements. The numbers and complexity of the mixtures depends on several factors, 
30 including the type of labeling system used, the length of the portions whose sequences are to 
be identifed. whether complexity reducing analogs are used, and the like. Preferably, for the 
embodiment illustrated in Figures la through lc, a single fluorescent dye is used to label each 
of 48 (=3 x 16) tag complements. The tag complements are applied individually to identify 
the nucleotides of each of the four-nucleotide portions of the target polynucleotide (i.e., 4 tag 
3 5 complements for each of 1 2 positions for a total of 48) Clearly, portions of different lengths 
would require different numbers of tag complements, e.g. in accordance with this 
embodiment, a 5-nucleotide portion would require 20 tag complements, a 2-nucleotidc portion 
would require 8 tag complements, and so on. The tag complements arc applied under 
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*. rag - .M ft- .H= encoded rags . ft* U,e «i - bo 

sequences of (he 4-mer portions of the largo sequent 



10 



ANNN 


NANN 


NNAN 


NNNA 


CNNN 


NCNN 


NNCN 


NNNC 


GNNN 


NGNN 


NNGN 


NNNG 


TNNN 


NTNN 


NNTN 


NNNT 



, . j a r r. nrT Thus for each nucleotide position 
! where «n» is any one of the nucleotides. A, C, G, or l nus. lor u i~ 

four separate mterrogauons are made, one for each kind of nucleotide. Tms embod.ment 
.ncorporates a significant degree of redundancy (a total of 16 tag complements are used to 
,5 idem* 4 nucleot.de) in exchange for .ncreasco rcliabihty of nucleoudc detemunaUon. 

Alternative*, 12 mixtures of 4 tag complements each cou.d apphed m success^ by 
usmg four spectrally d.stingu.shab.e fluorescent dyes, such that there is a one-to^ne 
, correspondence between dyes and kinds of nucleotide. For example, a m.xturc of 4 tag 

* compLnts may .dentify nucleotide V in the prot^ng strand sequence nnxn so th* a 

20 first fluorescent label is observed if**, a second fluorescent label ,s observed .f x=C, a 
third fluorescent label .s observed if x=G, and so on. 

Further sequence information can be obtamcd using the cmboduncnt desenbed above 
in a process analogous to the "multi-stepping" process d,sc.osed ,n Brenner, Intemauonal 
patent applicat.on PCT/US95/03678 (WO 95/27080). In tins embod.ment. a fourth adaptor, 
25 referred to herein as a "steppmg adaptor," .s ligated to the ends of the target poU.uc.cot.des 
a ,ong w,th cleavage adaptors A,. A 2 . and A 3 . for example, in a concentrate rat.o of 
3 , ! 1 Thus approximately half of the avauablc ends arc Ugated to the steppmg adaptor. 
The steppmg adaptor include a recognition site for a type II, nuclease portioned such that ,ts 
rc3C h (defined below) will permit cleavage of the target polynuc.eot.des at the end of the 
sequence determined via cleavage adaptors A,, A 2 . and A 3 . An example of a steppmg 
adaptor that could be used with the above set of cleavage adaptors .s as follows . 



30 



NN ... NCTGGAGA cgannnnnnnnnnnnnnnnnnn ... -3| 

35 NN ... N GACCTCT GCp tnnnnnnnnnnnnnnnnnnn ... !> 

where as above, the recognition s.te of the nuclease, in this case BpM 1. is singly underlined 
and the nucleotides at the cleavage site are doubly underlined. The target polypeptides 
cleaved w.th the nuclease of the stepping adaptor may be ligated to a further set of cleavage 
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adaptors A 4 , A5, and Ag, which may contain nuclease recognition sites that are the same or 
different than those contained in cleavage adaptors A j , A2, and A3. Whether or not an 
enlarged set of encoded adaptors is required depends on whether cleavage and ligation 
reactions can be tolerated in the signal measurement apparatus. If, as above, it is desired to 
minimize enzyme reactions in connection with signal measurement, then additional sets of 
encoded adaptors must be employed. That is, where above 768 oligonucleotide tags and tag 
complements were called for, with six cleavage reactions producing protruding strands of four 
nucleotides each, 1536 oligonucleotide tags and tag complements (24 mixtures of 64 tag 
complements each) would be required. Exemplary, cleavage adaptors A4, A5, and A$, with 
the same nuclease recognition sites as A] , A2, and A3, and which could be used with the 
stepping adaptor shown above arc as follows: 



{4) NN ... NGAAGACNN nnnnnnnnnnnnnnnnnn ... -3' 

NN . . . NCTTCTGp nnnnnjannnnnnnnnnnnnn ... -5' 



(5) NN ... NGCAGCACNN nnnnnnnnnnnnnnnnnn ... -3* 

NN ... N CGTCG TGp nnnnnnnnnnnnnnnnnnnn ... -5 ! 



(6) NN ... NGGGACNN nnnnnnnnnnnnnnnnnn ... -3 f 

NN ... NCCCTGp nnnnnnnnnn nnnn nnnnnn ... -5' 

where the cleavage sites arc indicated by double underlining. Cleavage adaptors A 4 , A5, and 
A 6 arc preferably applied as mixtures, such that every possible two-nucleotidc protruding 
strand is represented. 

Once the encoded adaptors have been ligated, the target polynucleotides arc prepared 
for loading onto solid phase supports, preferably microparticles, as disclosed in Brenner, 
International patent application PCT/US95/ 12791 (WO 96/12041). Briefly, the 
oligonucleotide tags for sorting are rendered single stranded using a "stripping" reaction with 
T4 DNA polymerase, e.g. Kuijper et al (cited above) The single stranded oligonucleotide 
tags arc specifically hybridized and ligated to their tag complements on microparticles. The 
loaded microparticles are then analyzed in an instrument, such as described in Brenner (cited 
above) which permits the sequential dcliveiy, specific hybridization, and removal of labeled 
tag complements to encoded adaptors. 

In embodiments where encoded adaptors are ligated to a target polynucleotide (or 
population of target polynucleotides) only one time, there arc several non-enzymatic template- 
driven methods of ligation that may used in accordance with the invention. Such ligation 
methods include, but arc not limited to, those disclosed in Shabarova, Biochimie 70; 1323- 
1334 (1988); Dolinnaya et al, Nucleic Acids Research, 16: 3721-3738 (1988); Letsinger ct al, 
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U.S. patent 5,476,930; Gryaznov et al, Nucleic Acids Research, 22: 2366-2369 (1994), Kang 
et al, Nucleic Acids Research, 23: 2344-2345 (1995); Gryaznov et al, Nucleic Acids 
Research, 21: 1403-1408 (1993); Gryaznov, U.S. patent 5,571,677; and like references. 
Preferably, non-enzymatic ligation is carried out by the method of Letsinger et al (cited 
5 above). In this method, an encoded adaptor having a 3 , -bromoacetylated end is reacted with a 
polynucleotide having a complementary protruding strand and a thiophosphoryl group at its 5' 
end. An exemplary encoded adaptor employing such chemistry has the following structure: 

i 

§ BrCH 2 (=0)CNH-(B ) r (B ) S (B ) q (B ) t -3 ' 

*j 10 B'-zB'B'B'B' B' (B'lrlB'lslB'lq-S' 

< where B and B* are nucleotides and their complements, and z, r, s, q, and t are as described 

below. Br, C, H, and N have their usual chemical meanings. As explained in the above 
references, in a template-driven reaction, the 3'-bromacctylatcd oligonucleotide reacts 
1 5 spontaneously with an oligonucleotide having a S'-thiophosphoryl group under aqueous 

conditions to form a thiophosphorylacetylamino linkage. A thiophosphoryl group is readily 
attached to the 5' hydroxyl of a target polynucleotide by treatment with T4 kinase in the 
presence of adcnosine-5'-0-(l-thiotriphosphate), i.e. g-S-ATP, as described in Kang et al 

si 

3 (cited above). 

a 20 

Sequence Analysis with Cvcles of Ligation and Cleavage 
Encoded adaptors may be used in an adaptor-based method of DNA sequencing that 
includes repeated cycles of ligation, identification, and cleavage, such as the method described 
in Brenner, U.S. patent 5,599,675 and PCT Publication No. WO 95/27080. Briefly, such a 
25 method comprises the following steps: (a) ligating an encoded adaptor to an end of a 

polynucleotide, the encoded adaptor having a nuclease recognition site of a nuclease whose 
cleavage site is separate from its recognition site, (b) identifying one or more nucleotides at 
the end of the polynucleotide by the identity of the encoded adaptor ligated thereto; (c) 
cleaving the polynucleotide with a nuclease recognizing the nuclease recognition'sitc of the 
30 encoded adaptor such that the polynucleotide is shortened by one or more nucleotides; and (d) 
] repeating said steps (a) through (c) until said nucleotide sequence of the polynucleotide is 

determined. In the identification step, successive sets of tag complements arc specifically 
■i hybridized to the respective tags carried by encoded adaptors ligated to the ends of the target 

polynucleotides, as described above. The type and sequence of nucleotides in the protruding 
35 strands of the polynucleotides arc identified by the label carried by the specifically hybridized 
tag complement and the set from which the tag complement came, as described above. 
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10 



15 



20 



nitpm.riwi*! Tags ~>* T »p rnmplements 
Ohgonucleotide tags are employed for two different purposes in the preferred 

of the invention Oligonucleotide tags are employed as described m Brenner, 

- PCTAJS96/09513 <WO 

^2 Jm I), to sort large numbers of poiyoucreo*. eg. se«ra, - 

Zld thousarf, »— — *» "» 1 "" 0n! ^ ^ 

of a few' tens to a few thousand For the former use, large numbers, or repertoires, of tags arc 
wic a» y retired, and therefore synthesis of mdividoa, •*-*•<**•- 

^ extreme* large repertoires of tags are as for dehvertng Ube.su, 

!Ld adaptors. o»gonue,eot,de tags of a mininraHy cr„s,hybr,d,»ng set ma, be separa^y 

synthesized, as well as synthesized combinatory. 

As desenbed in Brenner (cited above), the nucleot.de sequences of ohgonucleoudes 

of, mimmany cross-hybrtdiamg set are eon«,cn,ly — *T «"»-' 

programs, such as those exempli by the programs whose source codes are hsted tn 
Appends 1 and 11. Simitar computer programs are readily wntten for listing 
Znucleoudes of muum* cross-hybrtd^ing sets for any ***** of 0« mventron. 
Table I below provides guidance as to the s.ze of sets of muumally cross-hybnd.zmg 

computer programs were used to generate the numbers. 



25 



Oligonucleotide 
Word 
Length 



4 

6 

6 

8 

8 

8 

12 



Table I 

Minimally Cross-Hybridizing Sets of Words 

pomntir" ^fPm.r Nucleotides 
Nucleotide 



Difference 

between 
Oligonucleotide 

s of Minimally 

Cross- 
Hybridizing Set 



3 

4 

5 

4 
5 
6 
8 



Maximal 
Size of 
Minimally 

Cross- 
Hybridizing 
Set 



11 
25 
4 

225 
56 
17 
62 



Size of 
Repertoire 
with Three 

Words 



1331 

15,625 

64 

1.14 x 10 7 
1.75 xlO 5 
4913 



Size of 
Repertoire with 
Four Words 



14,641 
3.9 x 10 5 
256 
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JO 



15 



20 



25 



10 



Sets containing several hund red to seven,. . h 

«* * *^ C^rr ■ ° f CVCn SCVCral - of 
as disclosed i„ Fran k et al U < * ^ ° f P araJ 'el synthesis a „ J 

J// (1983), Matson et aJ, Anal R«v* ' Ucle,c A «ds Reseat 

i„. 1 4 '' Sou thenj et al i R; „, , . a1, Proc Natl Acarf 

"temanonal ap P l Ication Pcms '2 J B,ote <^ologv, 35. 217-227 09941 n 

matched tag complements tn h» me,t,n 8 temperature -n. 

matched tao^ . ««us to be more readiiv Features. tj, is 

P^ded by DubI , " SUbUn,t in Ae s « Guidance fo to 

«" (199^1 " NUC,C,C AC * dS 1 7 8543 ^^ 

Rev I k ^ Cta ' ,Pr0C NatI Acad Sc, « „1 ° 989 ^ >* 6409. 

0 , ^ M <* 26: 227-259 (I9 9l v " \ 3?46 ' 375 ° < ,98 <* Wetmur C „t 
^nucleotide tags are reared, such J 0 \ * * **» numbers^ 
C -P-cr pr 0granis Qf ^« *r del,v enng IabeIs „ ^ 

cross-hybrid*^ sets of " * " t0 »*™e and li st ,^1 
-catenanon int0 ., cntc J,! * * Wocifa that arc uscd dlr ' *" * 
such as rr } Such Jists can be fu«h„ } ( e Wlthou t 

energies of all ^ n Subu mts are lin^ ,l san,c 
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c • ,\ nucleotides shown in italic below, may 
o 'Wir<T of terminal nucieouuca, o» 
in multi-subunrt tags, a wora lt ^ a 

j::==rsx:== -- - 

the form: 



also 










k-l _ 


W k ' 


0" J 



10 



15 



nwnts With ends of tags always forming perfectly 
natched duplexes, ■» »™« d "" 1 "° u tave cached words a tor 

ends It is well known that duplexes witn 

WlIh °'^^^ submlK are m dc up of rhre. of -he four ^ 
^ybndWug « «** «*» of mc ^rf nuclide m 

As - * J"' * t0 ta , M dcd so.* ph*= supports 

ft. cigor.uelco.ide up per™* targ« P > ^ H(>>vi „ g is „ 

selected from the group conning of A, G, and T. 
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Word. 



Sequence : 



Word. 



Sequence: 



GATT 



Table II 

w 2 
TGAT 



w 3 
TAGA 



W 4 
TTTG 



W5 



GTAA 



W6 



AGTA 



W7 



ATGT 



W 8 



AAAG 



25 



^sse^^r-d-ad^^— 

nucleotides are employed. 
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The oligonucleotide tags of the invention and their complements arc conveniently 
synthesized on an automated DNA synthesizer, e.g. an Applied Biosystems, Inc. (Foster City, 
California) model 392 or 394 DNA/RNA Synthesizer, using standard chemistries, such as 
phosphoramiditc chemistry, e.g. disclosed in the following references: Bcaucage and Iyer, 
5 Tetrahedron, 48: 2223-23 1 1 (1992); Molko ct al, U.S. patent 4,980,460; Kostcr et al, U.S. 
patent 4,725,677; Caruthers et al, U.S. patents 4,415,732; 4,458,066; and 4,973,679; and the 
like. Alternative chemistries, e.g. resulting in non-natural backbone groups, such as peptide 
I nucleic acids (PNAs), W-rfS' phosphoramidates, and the like, may also be employed. In 

| some embodiments, tags may comprise naturally occurring nucleotides that permit processing 

1 0 or manipulation by enzymes, while the corresponding tag complements may comprise non- 
natural nucleotide analogs, such as peptide nucleic acids, or like compounds, that promote the 
formation of more stable duplexes during sorting. In the case of tags used for delivering label 
to encoded adaptors both the oligonucleotide tags and tag complements may be constructed 
from non-natural nucleotides, or analogs, provided ligation can take place, either chemically 

1 5 or enzymatically. 

Double stranded forms of tags may be made by separately synthesizing the 
complementary strands followed by mixing under conditions that permit duplex formation. 
| Alternatively, double stranded tags may be formed by first synthesizing a single stranded 

$ repertoire linked to a known oligonucleotide sequence that serves as a primer binding site. 

20 The second strand is then synthesized by combining the single stranded repertoire with a 

primer and extending with a polymerase. This latter approach is described in Oliphant et al, 
Gene, 44: 177-1 83 (1986). Such duplex tags may then be inserted into cloning vectors along 
with target polynucleotides for sorting and manipulation of the target polynucleotide in 

accordance with the invention. 
25 When tag complements arc employed that arc made up of nucleotides that have 

enhanced binding characteristics, such as PNAs or oligonucleotide W'-^PS 1 
phosphoramidates, sorting can be implemented through the formation of D-loops between tags 
comprising natural nucleotides and their PN A or phosphoramidatc complements, as an 
alternative to the "stripping" reaction employing the 3'->5' exonuclcasc activity of a DNA 
£ 30 polymerase to render a tag single stranded. 

Oligonucleotide tags for sorting may range in length from 12 to 60 nucleotides or 
, bascpairs. Preferably, oligonucleotide tags range in length from 1 8 to 40 nucleotides or 

•i basepairs. More preferably, oligonucleotide tags range in length from 25 to 40 nucleotides or 

basepairs. In terms of preferred and more preferred numbers of subunits, these ranges may 
35 be expressed as follows: 
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Table III 

Numbers of Su bunits in Taps in Preferred Embodiments 

Monomers 

in Subunit Nucleotides in Oligonucleotide Tay> 

(12-60) (18-40) (25-40) 

3 4-20 subunits 6-13 subunits 8-13 subunits 

4 3-15 subunits 4-10 subunits 6-10 subunits 

'4 

I 5 2-12 subunits 3-8 subunits 5-8 subunits 

?* 

6 2-10 subunits 3-6 subunits 4-6 subunits 

5 

Most preferably, oligonucleotide tags for sorting are single stranded and specific 
^ hybridization occurs via Watson-Crick pairing with a tag complement. 

Preferably, repertoires of single stranded oligonucleotide tags for sorting contain at 
least 100 members; more preferably, repertoires of such tags contain at least 1000 members; 
1 0 and most preferably, repertoires of such tags contain at least 10,000 members. 

Preferably, repertoires of tag complements for delivering labels contain at least 16 
I members; more preferably, repertoires of such tags contain at least 64 members. Still more 

v preferably, such repertoires of tag complements contain from 16 to 1024 members, e.g. a 

number for identifying nucleotides in protruding strands of from 2 to 5 nucleotides in length. 
1 5 Most preferably, such repertoires of tag complements contain from 64 to 256 members. 

Repertoires of desired sizes arc selected by directly generating sets of words, or subunits, of 
the desired size, e.g. with the help of the computer programs of Appendices I and II, or 
repertoires arc formed generating a set of words which arc then used in a combinatorial 
synthesis scheme to give a repertoire of the desired size. Preferably, the length of single 
20 stranded tag complements for delivering labels is between 8 and 20. More preferably, the 
length is between 9 and 15. 

Triplex Tags 

i In embodiments where specific hybridization occurs via triplex formation, coding of 

25 tag sequences follows the same principles as for duplex-forming tags; however, there are 

further constraints on the selection of subunit sequences. Generally, third strand association 
via Hoogsteen type of binding is most stable along homopyrimidine-homopurine tracks in a 
double stranded target. Usually, base triplets form in T-A*T or C-G*C motifs (where 
indicates Watson-Crick pairing and B * M indicates Hoogsteen type of binding); however, other 
30 motifs are also possible. For example, Hoogsteen base pairing permits parallel and 

antiparallel orientations between the third strand (the Hoogsteen strand) and the purinc-rich 
strand of the duplex to which the third strand binds, depending on conditions and the 
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composition of the strands. There is extensive guidance in the literature for selecting 
appropriate sequences, orientation, conditions, nucleoside type (e.g. whether ribose or 
deoxyribose nucleosides are employed), base modifications (e.g. methylated cytosinc, and the 
like) in order to maximize, or otherwise regulate, triplex stability as desired in particular 
embodiments, e.g. Roberts et al, Proc. Natl. Acad. Sci., 88. 9397-9401 (1991); Roberts et al 
Science, 258: 1463-1466 (1992); Roberts et al, Proc. Natl. Acad. Sci., 93: 4320-4325 
(1996); Distefano et al, Proc. Natl. Acad. Sci., 90: 1179-1183 (1993); Mergny etal 
Biochemistry, 30. 9791-9798 (1991); Cheng et al, J. Am. Chem. Soc.,' 1 14. 4465-44*74 
(1992); Beal and Dervan, Nucleic Acids Research, 20: 2773-2776 (1992); Beal and Dcrvan 
J. Am. Chem. Soc, 1 14. 497M982 (1992); Giovannangeli et al, Proc. Natl. Acad. Sci 89- 
8631-8635 (1992); Moscr and Dervan, Science, 238: 645-650 (1987); McShan et al J Biol 
Chcm., 267:5712-5721 (1992); Yoon et al, Proc. Natl. Acad. Sci., 89: 3840-3844 (1992)- ' 
Blume etal, Nucleic Adds Research, 20: 1777-1784 (1992); Thuong and Helene Angew 
Chem. Int. Ed. Engl. 32. 666-690 (1993); Escude et al, Proc. Natl. Acad. Sci.. 93 4365-4369 
(1996); and the like. Conditions for annealmg single-stranded or duplex tags to their single- 
stranded or duplex complements are well known, e.g. Ji et al, Anal. Chem 65- 1323-1328 
(1993); Cantor et al, U S patent 5,482,836; and the like. Use of triplex tags in sorting has 
the advantage of not requiring a "stripping" reaction with polymerase to expose the tag for 
annealing to its complement. 

Preferably, oligonucleotide tags of the invention employing triplex hybndization are 
double stranded DNA and the correspond^ tag complements are single stranded More 
preferably, 5-methylcvtosine is used in place of cytosme in the tag complements in order to 
broaden the range of pH stability of the triplex formed between a tag and its complement 
Preferred conditions for forming triplexes are fully disclosed in the above references Briefly 
hybndizanon takes place ,n concentrated salt solution, e.g. l.OMNaCI. 1 .0 M potassium 
acetate, or the like, at pH be.ow 5.5 ( or 6.5 if 5-methylcytosine is emp.oycd). Hybnchzation 
temperature depends on the length and composition of the tag; however, for an 18-20-mer tag 
or longer, hybruhzation at room temperature is adequate. Washes m av be conducted with less 
concentrated salt solutions, e.g. 10 mM sodium acetate, 100 mM MgCl 2 , pH 5 8 at room 
temperature Tags may be eluted from the.r tag complements by incubation in a similar salt 
solution at pH 9.0 

Mmunally cross-hybrid.zing sets of oligonucleotide tags that form triplexes may be 
generated by the computer program of Appendix II, or similar programs. An exemp.ary set 
of double stranded 8-mer words are listed below ,n cap,tal letters with the correspond 
complements ,n small letters. Each such word differs from each of the other words in the set 
by three base pairs. 
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5 ' -AAGGAGAG 
3'-TTCCTCTC 
3'-ttCCtCtc 

5 ' -AAAAAAAA 
3 ' -TTTTTTTT 
3'-tttttttt 

5' -AAAAAGGG 
3' -TTTTTCCC 
3'-tttttccc 

5 ' -AAAGGAAG 
3'-TTTCCTTC 
3'-tttccttc 



Table IV 

Exemplary Minimally Prncc.» vHf1irin£ 
Set of DoiibleStranfteH K.™o r Tnr7 



5' -AAAGGGGA 
3' -TTTCCCCT 
3'-tttcccct 

5 ' -AAGAGAGA 
3'-TTCTCTCT 
3'-ttctctct 

5' -AGAAGAGG 
3' -TCTTCTCC 
3'-tcttctcc 

5 ' -AGAAGGAA 
3' -TCTTCCTT 
3' -tcttcctt 



S ' - AGAGAAGA 
3' -TCTCTTCT 
3' -tctcttct 

5 ' -AGGAAAAG 
3 ' -TCCTTTTC 
3' -tccttttc 

5 ' -AGGAAGGA 
3' -TCCTTCCT 
3' -tccttcct 

5 ' -AGGGGAAA 
3' -TCCCCTTT 
3' -tccccttt 



5 ' -AGGGGGGG 
3'-TCCCCCCC 
3' -tccccccc 

S'-GAAAGGAG 
3'-CTTTCCTC 
3'-ctttcctc 

5' -GAAGAAGG 
3'-CTTCTTCC 
3'-cttcttcc 

S'-GAAGAGAA 
3' -CTTCTCTT 
3'-cttctctt 



Table V 



Oligonucleoti 
de 
Word 
Ixnpth 

4 

6 

8 

10 

IS 

20 

20 

20 

20 



Repertoire Size of Various Double Stranded Tags 
That Form Triple with ThcjLTagCom^ 



Nucleotide 
Difference 
between 
Oligonucleotide 
s of Minimally 

Cross- 
Hybridizing Set 



2 
3 
3 
5 
5 

6 

7 

8 

9 



Maximal 
Size of 
Minimally 

Cross- 
Hybridizing 
Set 



8 
8 
16 
8 
92 
768 
484 
189 
30 



Size of 
Repertoire 
with Four 

Words 



4096 
4096 
6.5 x I0 4 
4096 



Size of 

Repertoire 
with Five 
Words 

3.2 xlO 4 
3.2 x 10 4 
105 xlO 6 
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Synthesis and Structure of Adaptors 
The encoded adaptors and cleavage adaptors are conveniently synthesized on 
automated DNA synthesizers using standard chemistries, such as phosphoramidite chemistry, 
5 e.g. disclosed in the following references: Beaucage and Iyer, Tetrahedron, 48: 2223-23 1 1 
(1992), Molko et al, U.S. patent 4,980,460, Koster et aJ, U.S. patent 4,725,677; Caruthers ct 
al, U.S. patents 4,415,732; 4,458,066; and 4,973,679, and the like. Alternative chemistries, 
e.g. resulting in non-natural backbone groups, such as phosphorothioate, phosphoramidate, 
I and the like, may also be employed provided that the resulting oligonucleotides arc compatible 

1 0 with the ligation and/or cleavage reagents used in a particular embodiment. Typically, after 
^ synthesis of complementary strands, the strands arc combined to form a double stranded 

adaptor. The protruding strand of an encoded adaptor may be synthesized as a mixture, such 
- that every possible sequence is represented in the protruding portion. Such mixtures arc 
readily synthesized using well known techniques, e.g. as disclosed in Telenius et al, 
15 Genomics, 13: 718-725 (1992); Welsh et al, Nucleic Acids Research, 19: 5275-5279 (1991); 
Grothucs ct al, Nucleic Acids Research, 21: 1321-1322 (1993), Hartley, European patent 
application 90304496.4 (EP Publication No. 395398); and the like. Generally, these 
jj techniques simply call for the application of mixtures of the activated monomers to the 

■ growing oligonucleotide during the coupling steps where one desires to introduce multiple 

20 nucleotides. As discussed above, in some embodiments it may be desirable to reduce the 
complexity of the adaptors. This can be accomplished using complexity reducing analogs, 
such as dcoxyinosmc, 2-aminopurinc, or the like, e g as taught in Kong Thoo Lin ct al, 
Nucleic Acids Research, 20: 5149-5152, or by U.S. patent 5,002,867; Nichols ct al, Nature, 
369. 492-493 ( 1 994); and the like. 

25 In som e embodiments, it may be desirable to synthesize encoded adaptors or cleavage 

adaptors as a single polynucleotide which contains self-complementary regions. After 
synthesis, the self-complementary regions arc allowed to anneal to form a adaptor with a 
protruding strand at one end and a single stranded loop at the other end Preferably, in such 
embodiments the loop region may comprise from about 3 to 10 nucleotides, or other 
a 30 comparable linking moieties, e.g. alkylethcr groups, such as disclosed in U.S. patent 

4,914,2 10. Many techniques arc available for attaching reactive groups to the bases or 
• intemuclcoside linkages for labeling, as discussed in the references cited below. 

When conventional ligases are employed in the invention, as described more fully 
below, the 5' end of the adaptor may be phosphorylated in some embodiments. A 5* 
35 monophosphate can be attached to a second oligonucleotide cither chemically or 

enzymatically with a kinase, e.g. Sambrook et al, Molecular Cloning A Laboratory Manual, 
2nd Edition (Cold Spring Harbor Laboratory, New York, 1989). Chemical phosphorylation 
is described by Horn and Urdea, Tetrahedron Lett., 27: 4705 ( 1 986), and reagents for 
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carrying out the disclosed protocols are commercially ava.lable, e.g. ? Phosphate-ON™ 
from Clontcch Uboratories (Palo Alto, California). 

Encoded adaptors of the invention can have several embodiments depending, for 
example on whether Single or double stranded tags are used, whether multiple tags are used, 
whether a 5' protrudmg strand or 3' protruding strand is employed, whether a 3' blocking 
group .s used, and the like. Formulas for several embodiments of encoded adaptors are 
shown below Preferred structures for encoded adaptors usmg one single stranded tag arc as 
follows: 



10 



15 



20 



25 



30 



5'-p(N) n (N ) t (N ) S (N ) q (N )t" 3 ' 
2 (N') r (N') s (N , ) q -5' 



or 



35 



p(N ) r (N ) S (N ) q (N > t -3' 
3•-z(N) n (N') r (N , ) s (N') q -5• 

where N ,s a nucleotide and N' is its complement, p is a phosphate group, z » a 3' hydroxyl or 
a 3' bloekmg group, n is an integer between 2 and 6, include, r ,s an integer greater than or 
equal to 0 s ,s an integer which is ether between four and six whenever the encoded adaptor 
has a nuclease recognition site or i. 0 whenever there ,s no nuclease recogmt.on s.tc, q is an 
mtegcr greater than or equal to 0, and t is an integer between 8 and 20. inclusive. More 
preferably n is 4 or 5, and t is between 9 and 15, mclus.ve Whenever an encoded adaptor 
contains a nuclease recognition s,te, the re g ,on of V nucleotide pa.rs is selected so that a 
predetermined number of nucleotides are cleaved from a target polynucleotide whenever the 
nuclease recogmzmg the she .s applied. The s.ze of V in a particular embodiment depends 
on the reach of the nuclease (as the term .s defined in U.S. patent 5,599,675 and WO 
95/27080) and the number of nucleotides sought to be cleaved from the target polynucleot.de. 
Preferably r is between 0 and 20; more preferably, r ,s between 0 and 12 The regK>n of "q" 
nucleotide pa,rs ,s a spacer segment between the nuclease rccognuion s,te and the tag region 
of the encoded probe The region of "q- nucleotide may include further nuclease recognition 
sues labelling or signal generating mo.et.es, or the like. The smgle stranded oligonucleotide 
of V nucleotides is a "t-mer" oligonucleotide tag selected from a minimally cross-hybnd.z.ng 
set. 

The 3' blocking group "z" may have a variety of forms and may include almost any 
chemical entity that precludes ligation and that docs not interfere with other steps of the 
method eg removal of the 3' blocked strand, ligation, or the like. Exemplary 3' blocking 
groups mclude, but arc not limited to, hydrogen (i.e. 3' deoxy), phosphate, phosphorothioate, 
acetyl, and the like. Preferably, the 3' blocking group is a phosphate because of the 
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convenience in adding the group during the synthesis of the 3' blocked strand and the 
convenience ,n removing the group with a phosphatase to render the strand capable of ligatton 
with a ligase. An oligonucleotide having a 3' phosphate may be synthesized using the 
protocol described in chapter 12 of Eckstein, Editor, 01igom.cleot.dcs and Analogues: A 

5 Practical Approach (IRL Press, Oxford, 1991). 

Further 3' blocking groups are available from the chemistries developed for 
revcrsablc chain terminating nucleotides in base-by-base sequencing schemes, e.g. disclosed in 
the following references: Cheeseman, U.S. patent 5,302,509; Tsien et al, International 
application WO 91/06678; Canard et al, Gene, 148: 1-6 (1994); and Metzker et al, Nucleic 

1 0 Acids Research, 22: 4259^1267 ( 1 994). Roughly, these chemistries permit the chemical or 
enzymatic removal of specific blocking groups (usually having an appendent label) to 
generative a free hydroxyl at the 3' end of a priming strand. 

Preferably, when z is a 3' blocking group, it is a phosphate group and the double 
stranded portion of the adaptors contain a nuclease recognition site of a nuclease whose 

1 5 recognition site is separate from its cleavage site. 

When double stranded oligonucleotide tags are employed that specifically hybridize 
with single stranded tag complements to form triplex structures, encoded tags of the invention 
preferably have the following form: 

20 5'-p(N)„(N ) r (N ) S (N ) q (N ) t -3« 

zIN'ltlNMslN'lqlN'lfS' 



25 



35 



or 



p(N ) r (N ) S (N ) q (N ) t -3' 
•-zfNlnlN'IrtrislN'lqlN'lt-S' 



where N, N", p, q, r, s, z, and n are defined as above. Preferably, in this embodiment t is an 

integer in the range of 12 to 24. 

Clearly, there are additional structures which contain elements of the basic designs set 
30 forth above that would be apparent to those with skill in the art. For example, encoded 
adaptors of the invention include embodiments with multiple tags, such as the following: 



or 



5'-p(N) n (N ) r (N ) S (N ).(M ) tl ... (N ) tk~ 2 ' 

z(N') r (N , ) s (N') q (N , ) t l ... < N '>tk- 5 

p{N ) r (N ) S (N ) q (N ) tl ... (N ) ck -3" 

3'-z(N) n (N') r (N') s ( N, )q' N ')tl ... (N'> t k-5' 



40 
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where the encoded adaptor includes k double stranded tags. Preferably t,-t- t, and , 
cither 1,2, or 3. 12 - 4 and k is 



i 



Hi 



5 Labeling Ta p Comp lempntc 

Tl» tag compter „f „ e can be labek* in a vanay of *,ys fo, 

«* « * — - — — — of radilivc 
fluorescent moioies. cdorimcmc moieties, cl^brn.ir.scen, monies, «, „ , te ' 

""^ *r kU, DNA and constructing DMA adaptors 

^Ma 0 be TO etal. to U3 BS ta.V„, ,„ ^ J« 
Fluorescent Prebcs and ^ cbemica* ffux*, Probes. Eu^e WEL 
- M-t DNA Probes. 2nd Edition (St^n P^, Ncw York , ^ZeT 
Orotic* and Anatogues: A Pracnea, Appreacb , 1R L P ress , Oxford miy 

Ac hkc. Many more pa n ,c ol a, memodo.og.cs appbcablc ,„ tbe invent™ ate disCosed in fc 
fo.Iov,,„ e sa mpl e„fre f c r e„ CCSiF „ ngaa , ilJSpatmI4 757 14|; 

paten. 5 1 5 ,.507; Crf*** US pa,c„, W „ ^ of ^ $ ' 

20 M~t,l7'™ ,cra : f rwr s,oup! ' : ,abion5ki - *- »** *-* 

249 H 996) ' Z5 ™° 80 colidc <•*«•* J" « a!. Nature Mete . 2: M6 . 
^1996); and Utdca ct al, U.S. patent 5, ,24.246 <b,ar*bcd DNA). Aaacbmcn, Sllcs „ 

25 a if - ""t""'"' ™° " ™" ° l, ° reSm " dy " u,ed " ,abcls <°< »» complements , , 
25 as dtsclosed by Menehen eta,, U.S. paten, 5, ,,8,934; Bcr^ « al PCT a L ' '* 

mo,c» means . s , Bm „ oe m „ hic „ conve) , s ^ ^ JJJ^ «■ 

abson-tton and/or cession prc^ies of™ or mote tnofccK, Sucb fluorescent properties 
mclude fluorescence mtenstty, fluorescence ,ifc time, emtssion spec™ cbaractcl 

30 energy transfer, and the like. 

lifting Adaptors and Prevfntin,, g f i f .t ; n , inn 
to accordance «h the preferred embodunen, „f „ ^ mim _ d 
^ed to me ends of targ* precedes ,„ prepare sucb ends fo, ev» JligattLT 

H d P, °'t P " fm ^'^"-^»'c— yusingaljuta 
« protoco, Many bgascs are known and arc sullablc f „ UK ,„ ^ ^ 

Lehmaj^ Science, ,86: ,90-,9, , , o 74); ^ „ ^ DNA • * 

— . nc Enzymes, V„,. , „ ^ ^ 
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ligases include T4 DNA ligase, T7 DN A ligase, E coli DNA ligase, Taq ligase, Pfu ligase, 
and Tth ligase. Protocols for their use are well known, e.g. Sambrook et al (cited above); 
Barany, PCR Methods and Applications, 1 : 5-16 (1991); Marsh et al, Strategics, 5: 73-76 
(1992); and the like. Generally, ligases require that a 5' phosphate group be present for 
5 ligation to the 3* hydroxyl of an abutting strand. This is conveniently provided for at least one 
strand of the target polynucleotide by selecting a nuclease which leaves a 5' phosphate, e.g. as 
Fokl. 

A special problem may arise in dealing with either polynucleotide ends or adaptors 
that are capable of self-ligation, such as illustrated in Figure 2, where the four-nucleotide 

10 protruding strands of the anchored polynucleotides are complementary to one another (1 14). 
This problem is especially severe in embodiments where the polynucleotides (1 12) to be 
analyzed are presented to the adaptors as unifonn populations of identical polynucleotides 
attached to a solid phase support (110). In these situations, the free ends of the anchored 
polynucleotides can twist around to form perfectly matched duplexes (116) with one another. 

15 If the 5' strands of the ends are phosphorylatcd, the polynucleotides are readily ligated in the 
presence of a ligase. An analogous problem also exists for double stranded adaptors. 
Namely, whenever their 5' strands are phosphorylated, the 5* strand of one adaptor may be 
ligated to the free 3* hydroxyl of another adaptor whenever the nucleotide sequences of their 
protruding strands arc complementary. When self-ligation occurs, the protruding strands of 

20 neither the adaptors nor the target polynucleotides arc available for analysis or processing. 
This, in turn, leads to the loss or disappearance of signals generated in response to correct 
ligations of adaptors to target polynucleotides. Since the probability of a palindromic 4-mcr 
occurring in a random sequence is the same as the probability of a repeated pair of 
nucleotides (6.25%), adaptor-based methods for dc novo sequencing have a high expectation 

25 of failure after a few cycles because of self-ligation. When this occurs, further analysis of the 
polynucleotide becomes impossible. 

The above problems may be addressed by implementing the invention in the following 
steps, which are illustrated in Figure 3 A for a preferred embodiment: (a) ligating (120) an 
encoded adaptor to an end of the polynucleotide (122), where the end of the polynucleotide 

30 has a dcphosphorylatcd 5* hydroxyl, and the end of the encoded adaptor (124) to be ligated 
has a first strand (126) and a second strand (128), and the second strand of the encoded 
adaptor has a 3* blocking group (130); (b) removing the 3* blocking group of the second 
strand after ligation, e.g. by washing (132), or by enzymatically or chemically removing the 
group in situ, e.g. by treatment with a phosphatase if the blocking group is a phosphate; (c) 

35 phosphorylating (134) the 5' hydroxyl of the polynucleotide; (d) ligating (136) a second strand 
(142) having an unblocked 3' moiety to regenerate the encoded adaptor (138); and (e) 
identifying (144) one or more nucleotides at the end of the polynucleotide by the identity of 
the encoded adaptor ligated thereto, e.g. via a fluorcscently labelled (140) tag complement. 
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The encoded adaptors and target polynucleotides may be combined for ligation either singly 
or as mixtures. For example, a single kind of adaptor having a defined sequence may be 
combined with a single k,nd of polynucleotide having a common (and perhaps, unknown) 
nucleotide sequence; or a Single kind of adaptor having a defined sequence may be combined 
5 w.th a mixture of polynucleotides, such as a plurality of uniform populations of identical 
polynucleotides attached to different solid phase supports in the same reaction vessel, e.g. 

by Brenner et al, Internationa, application PCT/US96/095 ,3 (WO 96/410, ,); or a 
mixture of encoded adaptors, particularly mixtures having different nucleotide sequences m 
their protruding strands, may be combined with a single kmd of polynuc.eotide; or a nuxturc 
10 ofencoded adaptors may be combined with a mixture of polynucleotides Whentheterm 

••adaptor" or "encoded adaptor," is used in the singular, it is meant to encompass natures of 
adaptors having Afferent sequences of protruding strands as well as a single kind of adaptor 
having the same sequence of protruding strand, m a manner analogous to the usage of the 

lcrm M pr° bc " 

, 5 Besides removing by melting, a 3' deoxy may be removed from a second strand by a 

polvmerase "exchange" reason disclosed in Kuijper et al, Gene. 1 ,2: 147-1 55 (1992); 
Asl'amdis et al, Nucleic Add. Research, 18: 6069-6074 (1990); and like references. Bnefly, 
the 5'->3" exonuclcasc activ,ty of T4 DNA polymerase, and Hke enzymes, may be used to 
exchange nucleotides in a priming strand with thc.r triphosphate counterparts in solution, e.g. 
20 Kuijper et al (cited above). Thus, with such a reaction a 3' dideoxvnucleotidc can be 

exchanged with a r-deoxy-3--hydroxynuclcot.de from a reaction mixture, wh.ch renders the 
second strand ligatable to the target polynucleotide after treatment with a poK-nucleot.de 
lcin3.se 

A preferred embodiment employing cycles of ligation and cleavage comprises the 
25 following steps: (a) l.gat,ng (220) an encoded adaptor to an end of the po.ynucleonde (222), 
the end of the polynucleotide havmg a dephosphory.ated 5' hydroxyl, the end of the double 
stranded adaptor to be ligated (224) having a first strand (226) and a second strand (228), the 
second strand of the double stranded adaptor having a 3' blocking group (230), and the double 
stranded adaptor having a nuclease recognition site (250) of a nuclease whose recognition sue 
30 ,s separate from its cleavage site, (b) removing the 3' blocking group after ligation, e.g. by 
ashing off the second strand (232); (c) phosphorating (234) the 5' hydroxyl of the 
polynucleotide; (d) ligating (236) a second strand (242) having an unblocked 3' mo.ety to 
regenerate the double stranded adaptor (238) and the nuclease recognition site (250); (e) 
identifying (244) one or more nucleotides at the end of the polynucleotide by the identity of 
35 the adaptor ..gated thereto; (f) cleaving (252) the polynucleotide with a nuclease that 

recognizes the recognition Site, such that the polynucleotide .s shortened by one or more 
nucleotides, the rccognmon site being positioned in the illustrated adaptor (224) so that the 



-26- 



BNSDOCID: <WO 9746704AU_> 



3 

a 



10 



15 



WO 97/46704 

PCTYUS97/09472 

cleavage (254) removes two nucleotides from polynucleotide (222); (g) dcphosphorylating 
(256) the 5' end of the polynucleotide; and (h) repeating (258) steps (a) through (g). 

Typically, prior to ligation, ends of polynucleotides to be analyzed are prepared by 
digesting them with one or more restriction endonucleascs that produce predetermined 
cleavages, usually having 3' or 5 1 protruding strands, i.e. "sticky" ends. Such digestions 
usually leave the 5' strands phosphorylatcd. Preferably, these 5' phosphorylated ends arc 
dephosphorylated by treatment with a phosphatase, such as calf intestinal alkaline 
phosphatase, or like enzyme, using standard protocols, e.g. as described in Sambrook et a] 
Molecular Cloning, Second Edition (Cold Spring Harbor Laboratory, New York, 1989). By 
removal of the 5' phosphates the target polynucleotides arc rendered incapable of being ligated 
in the presence of a ligasc. The step of dcphosphorylating preferably leaves a free 5" 
hydroxy!. 

Preferred NmpIp^cpc 

"Nuclease" as the term is used in accordance with the invention means any enzyme 
combination of enzymes, or other chemical reagents, or combinations chemical reagents and 
enzymes that when applied to a ligated complex, d.scusscd more fully below, cleaves the 
ligated complex to produce an augmented adaptor and a shortened target polynucleotide A 
nuclease of the invention need not be a single prote.n, or cons.st solclv of a combination of 
proteins. A key feature of the nuclease, or of the combination of reagents emplovcd as a 
nuclease, is that its (their) cleavage site be separate from its (their) recognition sue. The 
distance between the recognition site of a nuclease and its cleavage site will be referred to 
hercn as us "reach." By convention, "reach" is denned by two integers which give the 
number of nucleotides between the recognition site and the hydrolyzcd phosphod.ester bonds 
of each strand. For example, the recognition and cleavage properties of Fok 1 is typically 
represented as "GGATG(9/13)" because it recognizes and cuts a double stranded DNA as 
follows (SEQ ID NO: 2): 

NNGGATGNNNNNNNNN NNNNNNNNNN 
3 - . . . NNCCTACNNNNNNNNNNNNN NNNNNN .' '. 

where the bolded nucleotides are Fok I's recognition site and the N's are arbitrary nucleotides 
and their complements. 



20 



25 



35 



It is important that the nuclease only cleave the target polynucleotide after it forms a 
complex with its recognition site; and preferably, the nuclease leaves a protruding strand on 
the target polynucleotide after cleavage. 

Preferably, nucleases employed in the invention are natural protein endonucleases (i) 
whose recognition site is separate from its cleavage site and (ii) whose cleavage results in a 
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prouudmg strand on the urge, polj^etoi*. Most prefcr,^ clMS „ 5 ^ 

« employed as „ invenUon , „ 

W, ,00: ,3.26 ( ,99, k Ro^e,,,, NllclejcAcids 

Livakand Bre^er. U.S. patent 5,093.*, Exemplary class „ s ^ " ^T 
— A'» X,, Bsm Al, Bbv ,, Bsm F1 . SK , H8a ,, Bsc ... Bbv , "J* * 

H h 1, Mbo , , Mme ,, R, e AI, Sap ,, Sfa N1 . Ta0 II, 1th 1 1 „,, Be„ 5I , Bpu M f ^ 

i £ : itt ^ *-* '■ « * £ - 

| MaNI. Bbv I is the most preferred nuclease. 

10 Preferably, prior to nuclease cleav^e stops, usually a, Ac sum of a seouenci,* 

IT rT'™ 5 " * '** *• *» - cTeTagc 
^offtenuceascbeinserap^ed. ™ S prevents undesired Ceavagc of d,e BW 
P^vnuclect,de because of the {omnms of ^ • 

km., the targe, polynucleotide. Blocking can be aehteved in a variety of ' 
»c, Ration and ^ byseq _ c . wlfic ^ ^ 

Wleot^es that form triplexes. m w ^ ^ « 
-Bp*. «. can be conven.en.ly bfcexed by meting the target p^^^T 
j the eogna* m^ylase ««. nuefcase ben* used. T^t is, fo, most j, ^Teria, 

:r; ^ of pa„,cu„, y N ,v E „ ela „d B r 

» PCR step „ employed in preparing target polynucleotides fo r scouencina 5 m e,hv, 7 
-*M ^ m thc ^ ^ ^ ^ P ed b, 

ford, abT'T^' "* Sk "' " "* C °" ld C< "" b ™ ^ ° f »* «*«•»««> sc, 

Gently. k„ S of the nrvcnuon mclude encoded adaptors, Oeavage adaptors and Ubc JT 
Kits ^ the nudease reagena, * liga.,0. T 

employing natural orcein crctacleases and ligases, l,gase buffers and nuclease buffi m 
be inc uded. In some ca^c u cc nuclease Duffers may 

m «h . , S ^ bC ident,Ca1 ' Such kits ™y ^so mclude a 

mcthylase and its reaction buffer Prefi-^Ki., u . , . «»wiuaea 

ourtcr. Preferably, kits also include one or more solid phase 
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supports, e.g. microparticlcs carrying tag complements for sorting and anchonng target 
polynucleotides. 

Attaching Ta <« Polynucleotides 
Fnr Sorting onto Solid Pha se Supports 
An important aspect of the invention is the sorting and attachment of a populations of 
polynucleotides, e.g. from a cDNA library, to microparticlcs or to separate regions on a sohd 
phase support such that each microparticle or region has substantially only one kind of 
polynucleotide attached. This objective is accomplished by insuring that substanually all 
10 Afferent polynucleotides have different tags attached. This condition, ,n turn, is brought about 
by taking a sample of the full ensemble of tag-polynucleotidc conjugates for analysis. (It ,s 
acceptable that identical polynucleotides have different tags, as it merely results ,n the same 
polynucleotide being operated on or analyzed twice m two different locations.) Such samphng 
can be carried out either overtly-for example, by taking a small volume from a larger 
1 5 m,xture~aftcr the tags have been attached to the polynucleotides, it can be earned out 
inherently as a secondary effect of the techniques used to process the polynucleotides and 
tags or sampling can be carried out both overtly and as an inherent part of processing steps. 

Preferably .n constructing a cDNA library where substantially all d.ffcrcnt cDNAs 
have different tags, a tag repertoire is employed whose complexity, or number of distinct tags, 
•»0 greatlv exceeds the total number ofmRNAs extracted from a cell or tissue sample. 

Preferably the complexity of the tag repertoire is at least 10 t,mcs that of the polynucleotide 
population; and more preferably, the complexity of the tag rcpcrto.re ,s at least 100 tunes that 
of the polynucleotide population. Below, a protocol is disclosed for cDNA library 
construction using a pnmcr mixture that contains a full repertoire of exemplary 9-word tags. 
25 Such a mixture oftag-contammg primers has a complexity of 8 , or about 1.34 x 10 As 
indicated bv Window et al, Nucleic Adds Research, 19: 325 1-3253 (1991), mRNA for 
library construction can be extracted from as few as 10-100 mammalian cells, fence a Single 
mammalian cell contains about 5 x 10 5 copies of mRNA molecules of about 3.4 x 10 
different kinds, by standard techniques one can isolate the mRNA from about 100 cells, or 
30 (theoretically) about 5 x 1 0 ? mRN A molecules. Comparing this number to the complexity of 
the pnmcr mixture shows that without any additional steps, and even assuming that mRNAs 
arc converted into cDNAs with perfect efficiency ( 1 % efficiency or less is more accurate), the 
cDNA library construction protocol results in a population containing no more than 37% of 
the total number of different tags. That is, without any overt sampling step at all, the protocol 
35 inherently generates a sample that comprises 37%, or less, of the tag repertoire. The 

probability of obtaining a double under these conditions is about 5%, which is within the 
preferred range. With mRNA from 10 cells, the fraction of the tag repertoire sampled is 
reduced to only 3 .7%, even assuming that all the processing steps take place at 100% 
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efficacy. ,„ faa to effic^ of Ae ^ 

5 polyn.cta.des ,„ ^ whm fc ^ rf ^ « «*«• <* 

pann* of ^ ^ ^ of nRNA „ ^ „ 

****** ^^of^^T^ 

If mRNA were extracted from 1 0 6 cells f whirh „,„ u 
PO,(A) + RNA), and lf pnmcrs were ^^^^ * "~ ° 5 « " 
caHcd for in . ^ caI protoco]> , g . J't^^TT'T " 

Pjcrat 1 mg^uals about ,, 8x .0^ olcs) , thcn ^ ctotalnu ;^ f ^ L "~ 
pob.uc.eot.de conjugates in a cDNA library wou ,d s,mp,y be equal to or less *an *. 
-«.ng numb er ofmRNAs, or about 5 x 1 0 " vectors contau,^ 
conjugates (again this assumes that each step in cDNA constmctln -TrsTstranTs H 
second strand synthes,s. and ligation ,nto a vector - occurs with ZJ l ™ *' 

a very conservative estimate The*,* , u P efficency), which is 

aiivc estimate. The actual number is significantly less 

If a sample of n tag-polynucleotidc conjugates are randomlv drawn from , 
nuxture-as could be effects h„ -i, , ««wwnuy drawn from a reaction 

effected by taking a sample volume, the probability of drawin* 
conjugates having the same tag is described by the Poisson distnbution PfrHf V/r h 
r is the number of conjugates having the same ta C and JL-n» i. 7 ' ^ 

g.ven tag bemg sc.ected. If n= ,0 6 1 34 X l' " " " 

in' 5 xu P-l/(l.34 x 10 ), then A.= 00746 and PP)=? 7 < v 

- 5 x 10H J^££^ — s: Assume that 

conjugates as inserts and tL the x 0 » ^ 

of 100m. Four ,0-fold serial dl t ^ ^ ^ a rcact ' on ^ion having a volume 

t0 ' d SCnal d,lut,ons ma >' ^ earned out by transferrin* 10 ul from a 
or.g,nal solution into a vessel containing 90 ul nf •» ° 

toniaimng yo ul ofan appropriate buffer such as TF n, 

Sxlo'TOorinoIralespcrul A2ul r * 2 l001 " ■<*«■ fining 

— « tag-cDNA eo^ " LI "TIT T™ ^ '° 

— — Of a c Ho* * ^ ^ ^ ^ 

Of course, as mentioned above no <:tf»n .« tu„ ~u 
efficiency. I„ parXlcular when ' ** " ** ^ pr0CCSS P"»*" with perfect 

y n pamcular. when vectors are employed to amplify a sample of tae- 
polynucleotidc coniueatcs the« m „p. r P 8 

jugates, the step of transform,^ a host is very inefficient. Usually, no 
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more than 1% of the vectors arc taken up by the host and replicated. Thus, for such a method 
of amplification, even fewer dilutions would be required to obtain a sample of 10 conjugates. 

A repertoire of oligonucleotide tags can be conjugated to a population of 
polynucleotides in a number of ways, including direct enzymatic ligation, amplification, e.g. 
5 via PCR, using primers containing the tag sequences, and the like. The initial ligating step 
produces a very large population of tag-polynucleotide conjugates such that a single tag is 
generally attached to many different polynucleotides. However, as noted above, by taking a 
sufficiently small sample of the conjugates, the probability of obtaining "doubles," i.e. the 
same tag on two different polynucleotides, can be made negligible. Generally, the larger the 

1 0 sample the greater the probability of obtaining a double. Thus, a design trade-off exists 
between selecting a large sample of tag-polynucleotide conjugatcs-which, for example, 
ensures adequate coverage of a target polynucleotide in a shotgun sequencing operation or 
adequate representation of a rapidly changing mRNA pool, and selecting a small sample 
which ensures that a minimal number of doubles will be present. In most embodiments, the 

1 5 presence of doubles merely adds an additional source of noise or, in the case of sequencing, a 
minor complication in scanning and signal processing, as microparticlcs giving multiple 
fluorescent signals can simply be ignored. 

As used herein, the term "substantially air in reference to attaching tags to 
molecules, especially polynucleotides, is meant to reflect the statistical nature of the sampling 

20 procedure employed to obtain a population of tag-molecule conjugates essentially free of 
doubles. The meaning of substantially all in terms of actual percentages of tag-molccule 
conjugates depends on how the tags arc being employed. Preferably, for nucleic acid 
sequencing, substantially all means that at least eighty percent of the polynucleotides have 
unique tags attached. More preferably, it means that at least ninety percent of the 

25 polynucleotides have unique tags attached. Still more preferably, it means that at least ninety- 
five percent of the polynucleotides have unique tags attached And, most preferably, it means 
that at least ninety-nine percent of the polynucleotides have unique tags attached. 

Preferably, when the population of polynucleotides consists of messenger RNA 
(mRNA), oligonucleotides tags may be attached by reverse transcribing the mRNA with a set 

30 of primers preferably containing complements of tag sequences. An exemplary set of such 
primers could have the following sequence: 

5 '-mRNA- [A] n -3' 

[T] 19 GG[W, W, W,C] qAC CAGCTG ATC-5 ' -biotin 

35 

where "[W,W,W,C]9" represents the sequence of an oligonucleotide tag of nine subunits of 
four nucleotides each and "fW,W,W,C]" represents the subunit sequences listed above, i.e. 
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el rCP r ,S T A 11,(5 UndCr,med SeqUCnCCS - -rict.cn 

e donuclease sue that can be used to release the polynucleotide from aaachmcnt to a sohd 

Phase .support v,a the lf onc 1S ^ For ^ ^ 

attached to a microparticlc could have the form «miptaneni 



5 ' - [G, W, W, WJ 9 TGG-linker- 



microparticle 



After reverse transcription, the mRNA ,s removed, e g by RNase H di<w , 
the second strand of the cDNA is synthesized using for cxaml a , ! ' "* 

10 form (SEQ ID NO 3) P ' ' pnBer ° f *° fo,,owin 8 



5 * -NRRGATCYNNN-3 ' 



) 



«*« N is any „ of A, T. G, or C; R is a punnc^,^ nudt00de „ d y 

*e «*, « dna * ch , sr 6 ,r:r ,,tt 
: v.. fe , ^ Bam HIMl sms Ate :; ~ n r 

exemplary conjugate would have the form: ".gestion, the 

5'- R CGACCA[C,W,W f W] 9 GG t T] 19 - cDNA -NNNR 

GGTfCW^W^ccfA]^- rDNA -NNNYCTAG-5 ' 

The polynucleot.de.tag conjugates may then be manipu.ated us.ng standard molecular bio. 
teehmcues. For examp.c, the above conjugate-which is actua.ly a mixture m b f 
- commercial, av,,ab,e Oonmg vectors, e g Stratagene a£^£X 
-sfected ,„to a host, such as a commerca,, av„ab,e host bacter., I ■ ^ 

it:: t number of conjusates The c,oning vcc - - » X 

Ho t H ^ 6 8 Sambr °° k " * MO,£CUlar Se - d **■ (Cold L g 

Ha b U b ewYork> 19g9) AI^ tiw ^^^^« 

be employed so that the conjugate population can be increased by PGR 

Sa] I h PrCf rf ^ ^ li8aSe " baSCd mCth0d ° f Sequendn S is -P'oyed the Bst Yl and 
Sal I chgested fragments are cloned mto a Bam HlVXho ,-d.gested vec or hav nlff ■ 
smgle-copy restriction sites: 8 ^ foIIow,n S 



5 ' -GA^CCTTTATGGATCCACTCGAGATCCCAATCCA-3 



FokI BamHI Xhol 
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This adds the Fok I site which will allow initiation of the sequencing process discussed more 
fully below. 

Tags can be conjugated to cDNAs of existing libraries by standard cloning methods. 
cDN As arc excised from their existing vector, isolated, and then ligated into a vector 

5 containing a repertoire of tags. Preferably, the tag-containing vector is linearized by cleaving 
with two restriction enzymes so that the excised cDNAs can be ligated in a predetermined 
orientation. The concentration of the linearized tag-containing vector is in substantial excess 
over that of the cDNA inserts so that ligation provides an inherent sampling of tags. 

A general method for exposing the single stranded tag after amplification involves 
1 0 digesting a target polynucleotide-containing conjugate with the 5 , -*3 1 exonuclease activity of 
T4 DNA polymerase, or a like enzyme. When used in the presence of a single 
deoxynucleoside triphosphate, such a polymerase will cleave nucleotides from 3' recessed 
ends present on the non-template strand of a double stranded fragment until a complement of 
the single deoxynucleoside triphosphate is reached on the template strand. When such a 

1 5 nucleotide is reached the 5'->3' digestion effectively ceases, as the polymerase's extension 
activity adds nucleotides at a higher rate than the excision activity removes nucleotides. 
Consequently, single stranded tags constructed with three nucleotides arc readily prepared for 

loading onto solid phase supports. 

The technique may also be used to preferentially mcthylatc interior Fok I sites of a 
20 target polynucleotide while leaving a single Fok I site at the terminus of the polynucleotide 
unmcthylated. First, the terminal Fok I site is rendered single stranded using a polymerase 
with deoxycytidinc triphosphate. The double stranded portion of the fragment is then 
methylated, after which the single stranded terminus is filled in with a DNA polymerase in the 
presence of all four nucleoside triphosphates, thereby regenerating the Fok I site. Clearly, this 
25 procedure can be generalized to endonuclcases other than Fok 1 

After the oligonucleotide tags arc prepared for specific hybridization, e.g. by 
rendering them single stranded as described above, the polynucleotides arc mixed with 
microparticles containing the complementary sequences of the tags under conditions that favor 
the formation of perfectly matched duplexes between the tags and their complements. There 
30 is extensive guidance in the literature for creating these conditions. Exemplary references 
providing such guidance include Wctmur, Critical Reviews in Biochemistry and Molecular 
Biology, 26. 227-259 (1991), Sambrook et al. Molecular Cloning: A Laboratory Manual, 2nd 
Edition (Cold Spring Harbor Laboratory, New York, 1989); and the like. Preferably, the 
hvbridization conditions are sufficiently stringent so that only perfectly matched sequences 

m 

3 5 form stable duplexes. Under such conditions the polynucleotides specifically hybridized 
through their tags may be ligated to the complementary sequences attached to the 
microparticles. Finally, the microparticles are washed to remove polynucleotides with 
unligatcd and/or mismatched tags. 
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When CPG microparticles conventionally employed as svmhncJc 

-essaty for so™ ^encng opera,™,. U,, is, in se^app^I ^* *" 
-ceessi™ treat™, rffc amched ^ , ^ » *» ■»« 

rt-Mi n» y ,end to « it ^ ^ ^JZZZZ 

IOOJ.«^, <wltepl-JM[teidc , i Tl,is ensures tta ,hc dercitv oL, , " 
on » - ™ surTaee will not be so high as to «,„ CrL^Sn" 
>»ragc ^-po^eo,*. spacrng on a. micropartidc surfacc „ ££££ 
- Gu,dancc ,„ select ^ f „ sffirfatd cpG suppom ^ * 

»M(1TO2). rVcferably, for Renting applications, standard CPG be«i, „f*. . 
»WC „f20-50 Mm arc loaded wid, about 10 5 po,^^ "» 
(GMA) beads avaifcbK fa. B„gs U^^^^^*-*- 
-0 arc loaded * a ». tens o f tbonsa. " 

* -c b capacity, the tolerance for multiple cod.c* nfmin^ 
— « eomplen™ „,. -bead tables-,, and ,Hc 1 J ^T** * 
^ rogard.ng n^parUele _^ ^ ^ , 



■/A 



Microparticlc 

diameter 5 * m 10 ^ 



Approx. area of 
monolayer of 10 6 

microparticles 4 c v , € 

45 x .45 cm ] x ] cm 



20 



nm 40 



Max. no. 

polynucleotides 
loaded at 1 per 

10 5 sq. angstrom 3 x l ° 5 126 xlO 6 



5x !0 6 



2 x 2 cm 4 x 4 cm 



The probability that the sample of microDaniele, m„t, 

30 ^^^zzzzzzz? 
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Table VI 



Number of 
microparticles in 
sample (as 
fraction of 
repertoire size), 



1.000 
.693 
.405 
.285 
.223 
.105 
.010 



Fraction of 
repertoire of tag 
complements 
present in 
sample, 



0.63 

0.50 

0.33 

0.25 

0.20 

0.10 

0.01 



Fraction of 
microparticles in 
sample with 
unique tag 
complement 
attached* 
m(c^V2 

0.37 
0.35 
0.27 
0.21 
0.18 
0.09 
0.01 



Fraction of 
microparticles in 
sample carrying 

same tag 
complement as 

one other 
microparticlc in 

sample 
("bead doubles"), 
m 2(e^V2 



0.18 

0.12 

0.05 

0.03 

0.02 

0.005 



10 



15 



Highsr**^ 



^^^^^ 

m increase » «*• — > * *- ° f 

ap (mowed in the hybridization reaction. As expla. 

ma, be ameliorated by -panning " sufficiently small 

Specifier, of ft. nytndiaauons may be b ^ 

sample »—* * «* * »£Z :H ^ Uis ,at«, 

".ay be nKt by utag P ^^^^ » 

conj-ga.es that . about ■ I P«" &om Tab , c ,,, . repc „o,,e of I». . 

example, .fogs are coastmeted »nh eign. ofug-cDNA 

ml nis and ugcoiruileirMs are produced. In a library oi b 
,bo». 1.67 ugs »d ug P ^ ^ |6 100 diffcien , ^ 

conjuga.es as described above, a 0.1 percent sa of panicles, or in 

„c present. If* - *?* ~ ' TTTj ,L .J*., sampled 

u «f i fV7 x 1 0 7 microparticles, then oni> a spar:*, => 
te example a samp e of 1 .67 * lOjnj^ ^ fec incrcase „ for 

nicies would be loaded. r mj , step in whic h the sampled 

exampl e, for more eff.oent ^J^^cUl- - 
tag-cDNA conjugates are used to separate toaa p 
IroparUclcs. Thus, m the example above, even thoueh a 0,1 percent P 
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only 16.700 cDNAs, the sampling and panning steps may be repeated until as many loaded 
microparticlcs as desired are accumulated. Alternatively, loaded microparticlcs may be 
separated from unloaded microparticlcs by a fluorcsccntly activated cell sorting (FACS) 
instrument using conventional protocols, e.g. tag-cDNA conjugates may be fluorcscently label 
in the technique described below by providing fluorcsccntly labelled right primer After 
loading and FACS sorting, the label may be cleaved prior to ligating encoded adaptors, e.g. 
by Dpn I or like enzyme that recognize methylated sites. 

A panning step may be implemented by providing a sample of tag-cDNA conjugates 
each of which contains a capture moiety at an end opposite, or distal to, the oligonucleotide 
tag. Preferably, the capture moiety is of a type which can be released from the tag-cDNA 
conjugates, so that the tag-cDNA conjugates can be sequenced wnh a single-base sequencing 
method Such mo.cfcs may comprise biotin, d.goxigemn, or like l.gands, a triplex binding 
region, or the like Preferably, such a capture mo,cty comprises a biotin component Biotin 
may be attached to tag-cDNA conjugates by a number of standard techniques. If appropriate 
adapters containing PGR primer binding sites arc attached to tag-cDNA conjugates b.otin 
may be attached by using a biotinylatcd primer ,n an amplification after sampling 
Alternatively, if the tag-cDNA conjugates are inserts of cloning vectors, biotin may be 
attached after excis.ng the tag-cDNA conjugates by digest™ with an appropriate restricts 
enzyme followed by ,solat.on and filhng in a protruding strand distal to the tags with a DNA 
polymerase m the presence of biotinylated uridine triphosphate. 

After a tag-cDNA conjugate is captured, it may be released from the b.otin moiety in 
a number of ways, such as by a chemical linkage that is cleaved by reduction, e g Herman et 
al, Anal. Biochem., 156; 48-55 (1986), or that is cleaved photochemically, c g Olejnik et al 
Nucleic Acids Research, 24: 36,-366 (.996), or that is cleaved cnzymatically bv mtroducing 
a restriction site ,n the PCR primer. The latter cmbod.mcnt can be exemphfied bv considering 
the library of tag-polynucleotidc conjugates described above: 



S'-RCGACCMCW^WJgGGmig- cDNA -NNNR 

GGT[G,W,W,W] 9 CC[A] 19 - rDNA -NNNYCTAG-5 ' 

Mowing adapters may be Hgated to the ends of these fragments to permit amplification 
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5 • - xxxxxxxxxxxxxxxxxxxx 

XXXXXXXXXXXXXXXXXXXXYGAT 
Right Adapter 



10 



15 



GATCZZACTAGTZZZZZZZZZZZZ-3 ' 
2ZTGATCAZZZZZZZZZZZZ 



Left Adapter 
ZZTGATCAZZZZZ2ZZZ222-5 1 -biotin 



Left Primer 



20 



25 



30 



35 



where ACTAOT- ,s . S pe , recogmUon * (whl ch lcavcs , ^ 

*«. — of ftc restive primers « appro»,ma,c,v fe ^ 8 
hgaoon of*. adapKK and „ y pcR ^ ^ bio[iny|attd 

the conjngaics arc rendered !i%lc stranded by the cxonuclcasc acvitv of T4 DNA 
P^yrnerasc and conjugates are combined „,„, . ^ of roicro( , a „ ides , 

«, „ cerements ^ ^ ^ ^ P ; 

rn.s-a.aeh™,, of tags). ftc Mnju6atts prefmbly ^ ^ 

eapture wnh av.dma.ed magnetic beads, „, l, ke capture technique * 

k ?T """"^ ™* « iff *™ "ft <** may be reused from ft, 
ma»*,e fcads by eieavr.ge with Spc ,. By ^ „,„ 40 . 50 «" 

-Pies of microbes a* lag-cDNA eonjoga.es, „ , ,0* cDNAs can be ac^Tled 
by poohngftc leased rmcrepanieics. H.e pe*d m ,e,opa„ i e,es m ay ften be 
stmullaneously sequenced by a single-base sequencing ieehruone 

*— — « now .any cDNAs „ anaiyzc, depends „„ «* object ,f the 
b^ve ,s ,o mo«,„, the changes „ ,b u „da„ e 0 f „,„ive,y common s= uenccs e g mak™ 
£ U or more of a popuiauon. ften rC.vCy sma„ samp,*, ,,, sma,, fraction f ^ 

oT H 7 ^ ^ — <*«■*» aborts ^ 17 

-he- hand, ,f „„ seeks » r^niror the abuntW of rare se q „e„ees, e g making p 
- of a popuhatron. to ^ ^ - J-P 
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^p* s ,ze and ft. «, of ft. MM * *— ■ »— - *■ 

(„«); Good, Bromcmka, 40: ,6-264 (.953). Bunge « all. Am. S*. A»t. ». 364 

Llary. mL preferabiy. a samp* of a, M I0» - 

^Liared for ft. of- m. «— W*. *• ° f 

prefer* .0 — » rc,„ve abundance of a 

population size. 

fWtmction " f « Tn<> Library 

An exemplary tag libra, is constructed as follows to form the chemically synthesized 
9-word tags of nucleotides A, G, and T defined by the formula. 

I 3'-TGGC-[ 4 <A,G,T)9]-CCCCp 

20 where ^((A G.T»Y -d.cates a tag nuxture where each tag cons.sts of n,ne 4-mer words of 
To, aid T; and V mAcatc a 5' phosphate. H* mixture is .igated to the fol.ow.ng nght 
and left primer bindin S regtons (SEQ ID NO: 4 & 5): 

rr.^rrr*rcc 5'- GGGGCCCAGTCAGCGTCGAT 

25 5'- AGTGGCTGGGCATCGGACCG GGGTCAGTCGCAGCTA 

TCACCGACCCGTAGCCp 

RIGHT 

LEFT 

30 The ft* and .eft prime, binding regums are boated » ft. above rag mi—, after *4 ft. 
branded p.ruon of rhe i.ga.ed — • * «- »'* DNA polymerase *en mrxed w„h 
*e righ. and leR primers indieared beta, and amplify to g.«e a us Itary. 
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40 



Left Primer 
5 ' - AGTGGCTGGGCATCGGACCG 

CCCCGGGTCAGTCGCAGCTA- 5 1 
Right Primer 

T* underlined portion of the .eft prime r binding region indieatcs a Rsr I. recognition site 
^e .eft-most undefined region of the right primer bindmg region indicates recognition s,tes 
for Bsp 1201, Apa I, and Eco 0 1091, and a cleavage site for Hga I Tnc right-most 
undcrhncd region of the right pnmer binding re gl on mdicatcs the recognition s.te for Hga I 
Opfonally, the right or left primers may be synthesized with a biot.n attached (usmg 
conventional reagents, e g available from Cbntcch Laboratories. Palo Alto, CA) to facilitate 
purification after amplification and/or cleavage. 

Construction of a Plasm.H T o f Taa .p H ivn,,ri^j ± 

Conjugates for cDNA ■■ ScQtie nr,nr, llc.n. 

Encoded Adaptors 

cDNA , s produced from an mRNA ample by convemiona! p,o,oeols using 
PG0CCCT 15 (A or G or C, as a primer fb, filst srrand u ^ 

.he poly A regron of lhc ^ ^ Ng(A „ ^ ^ ^ ^ ^ . 

sy,,hes. s . Tta is. bod, are degenerate primers such ma, 0,= second arand prime, , s ^ 
. « .form, : and me firs , smiK , ^ is prKe „ t „ ^ ^ ^ ^ > 

second s,„»d p„mc, corresponds » ,„e recogniuon siK of Mb<J , ; ^ ^ 
sues could be used as well, sueh as to for Bam Hi. Spn ,, Eeo R1 , o, me llk e 
presence of me A arui T adjacen, ,0 me rcsmedon sire of me seeond s.rand pnmer enures 

a s,r,pp» s and exchange rcacion can be used ,n .he nex, acp ,o generare a five-basc 5' 
overhang of "GGCCC" 7* firs, s,ra* prune, is an^^ „ ^ ^ ^> 
exmnded w,„ reverse .r^enpuse. after wmch me RNA s.rand i, degraded by me RNasc H 
acv„y of me reverse .ranscptasc Icavn* . single s ,ra„dcd cDNA. The second arl 
pnmer ,s annealed a* attended „,m . DNA po,yme,asc us,„g convent p,„,oeo,s ^ 
-«. sy*hesi s , te ^ c DNAs are „ed, yl a tt d »,,„ CpG mcmy asc (New 
Eng^nd B,<tt,s, Beve„y, MA) using manufacWs pro,„co,s. The T s.rands ofteD^ 
are men cu, bae k «,„ me above^m^d s,„ P ping „, exchange ,eaa,o, 
J*-» . , preset ofdATP am, dTTP, after wh,C me cDNAs are liga^ „ 
I.brary desenbed above previously cleaved w,,b Hg, , ,o g, v e me M^ e ^ 



-39- 



9746704A1 I > 



WO 97/46704 



PCT/US97/09472 



5'-biotin- | primer binding site I 



TAG 




Apa 1 




CDNA 





T 

Rsr II site 



Mbo I site 



Separately, the following cloning vector is constructed, e.g. starting from a commercially 
available plasmid, such as a Blucscript phagemid (Stratagcne, La Jolla, CA)(SEQ ID NO: 6) 



10 



15 



primer binding site 

'I 



Ppu MI site 



(plasmid) -5 ' 



-AAAAGGAGGAGGCCTTGATAGAGAGGACCT GTTTAAAC- 
-TTTTCCTCCTCCGGAACTATCTCTCCTGGA CAAATTTG - 




I 



20 



25 



30 



primer binding site 

i 

•GTTTAAAC-GGATCC-TCTTCCTCTTCCTCTTCC-3 ■ - (plasmid ) 
- CAAATTTG -CCTAGG-AGAAGG AG AAGGAGAAGG- 

T T 

Bam HI site 

Pme I site 

The plasmid is cleaved with Ppu MI and Pme 1 (to give a Rsr Il-compatible end and a flush 
end so that the insert is oriented) and then methylated with DAM methylase. The tag- 
containing construct is cleaved with Rsr II and then ligatcd to the open plasmid, after which 
the conjugate is cleaved with Mbo I and Bam HI to permit ligation and closing of the plasmid 
The plasmids arc then amplified and isolated for use in accordance with the invention. 
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Example 1 

Qj.q mmcing a T? r ff et Polynucleotide 
Am plified from nGEM 77 Identification of 
Nucleotide* hv Cycles o f Ligation and CleavaRe 
5 In this example, a segment of plasmid pGEM7Z (Promcga, Madison, WI) is 

amplified and attached to glass beads via a double stranded DNA linker, one strand of which 
is synthesized directly onto (and therefore covalcntly linked to) the beads. After the ends of 
the target polynucleotide are prepared for ligation to the encoded adapters, in each cycle of 
ligation and cleavage, a mixture of encoded adaptors ( 1024 different adaptors in all) is 
1 0 applied to the target polynucleotides so that only those adaptors whose protruding strands 
form perfectly matched duplexes with the target polynucleotides arc ligatcd. Each of 16 
fluorcscently labelled tag complements are then applied to the polynucleotide-adaptor 
conjugates under conditions that permit hybridization of only the correct tag complements. 
The presence or absense of fluorescent signal after washing indicates the presence or absence 
15 of a particular nucleotide at a particular location. The sequencing protocol of this example is 
applicable to multiple target polynucleotides sorted onto one or more solid phase supports as 
described in Brenner, International patent applications PCT/US95/ 12791 and 
PCT/US96/09513 (WO 96/12041 and WO 96/4101 1). 

A 47-mer oligonucleotide is synthesized directly on Ballotini beads (040-.075 mm, 
20 Jencons Scientific, Bridgcvillc, PA) using a standard automated DNA synthesizer protocol. 
The complementary strand to the 47-mer is synthesized separately and purified by HPLC. 
When hybridized the resulting duplex has a Bst XI restriction site at the end distal from the 
bead. The complementary strand is hybridized to the attached 47-mer in the following 
mixture: 25 uL complementary strand at 200 pmol/uL; 20 mg Ballotini beads with the 47- 
25 mcr; 6 uL New England Biolabs 3 restriction buffer (from I Ox stock solution); and 25 (iL 
distilled water. The mixture is heated to 93°C and then slowly cooled to 55°C after which 
40 units of Bst XI (at 10 units/uL) is added to bring the reaction volume to 60 uL. The 
mixture is incubated at 55°C for 2 hours after which the beads arc washed three times in TE 
(pH 8.0). 

30 The segment of pGEM7Z to be attached to the beads is prepared as follows: Two 

PCR primers were prepared using standard protocols (SEQ ID NO: 7 and SEQ ID NO: 8): 

Primer 1: 5 ' -CTAAACCATTGGTATGGGCCAGTGAATTGTAATA 

35 Primer 2: 5 ' -CGCGCAGCCCGCATCGTTTATGCTACAGACTGTC- 

AGTGCAGCTCTCCGATCCAAA 
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T* PGR reaction mixture consists of the foUowing: . pi P GEM7Z at 1 ng/pl; 10 pi pnmer 1 
r ^W O - PH- ' at 10 pntoM; 10 ,1 deoxynbonucleotide triphosphates at 2. 

M 0 ^ 10. PGR buffer (Perlcm-Elmer); 0.5 u. Taq DNA polymerase at , un,ts/ul; and 
mM; 10 til 10x PCK o i . . mcof ,oo W l The reaction mixture was subjected to 
58 ul distilled water to give a final volume of 100 pi. ^ 
Z If 930C for 30 sec; 60<>C for 15 sec; and 72<>C for 60 sec, to g.ve a 172 basepair 
25 cycles of 93 C for sec pcR mixturC) 12 ul lOx « 

i Kpw Encland Biolabs butler, 5 bdv i ai i ui»u r 

B» X. (^Bbv . ~d. *» is added: 5 id 1 M W 67 «** «a,», and , 

, Cli *~ -cede, *m <-* • Cenmcon 30 (Aniicon, W 

' t . w« „™.«ol the Bb» I/Bst Xl-restnctcd fragment is lifted <o 

c I—,! niAlnhs referred to be ow as NEB), 3 F l,ul,n " b 

England Biolabs, reterre ^ ^ ^ 

5 distilled water, which mixture is incubated at 25 C for 4 nours, a 

I^tim^ 

9) for sequencing having a 5' phosphate: 



20 



AGCTACCCGATC 
[BEAD]" • TCGATGGGCTAGATTTp- 

The 5' Phosphate is removed bv treating the bead mixture with an alkaline phosphatase, eg 
"lie, arable from New England Biolabs (Sever!, MA), using manufac.rcr s 

25 '^The top strands of the following ,6 sets of 64 encoded adaptors (SEQ ID NO: 10 
through SEQ NO 25) are each separately sv.thes.zed on an automated DNA ***** 
T^mL™ Biosystems, Foster Gity) using standard methods. T* bottom strand, 
^h is I si for aU adaptors, ,s synced separate, then hybrid to the respective 

top strands: 

30 Encoded Adaptor 

SEQ ID 



NO. 



10 5 ' -pANNNTACAGCTGCATCCCttggcgctgagg 

»MAN 

pATGCACGCGTAGGG- 5 ' 



pATGCACGCGTAGGG- 5 ' 
n 5 • -pNANNTACAGCTGCATCCCtgggcctgtaag 



, -> 5 ' -pCNNNTACAGCTGCATCCCttgacgggtctc 

pATGCACGCGTAGGG- 5 1 
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13 5 ' -pNCNNTACAGCTGCATCCCtgcccgcacagt 

pATGCACGCGTAGGG-5 1 

14 5 • -pGNNNTACAGCTGCATCCCttcgcctcggac 

pATGCACGCGTAGGG- 5 ' 

15 5 * -pNGNNTACAGCTGCATCCCtga tccgctagc 

pATGCACGCGTAGGG- 5 • 



* 16 



5 1 -pTNNNTACAGCTGCATCCCttccgaacccgc 
pATGCACGCGTAGGG- 5 ' 



: | 17 5* -pNTNNTACAGCTGCATCCCtgagggggatag 

• ! pATGCACGCGTAGGG- 5 • 

18 5 ' -pNNANTACAGCTGCATCCCttcccgctacac 
:J pATGCACGCGTAGGG- 5 ' 

19 5 ' -pNNNATACAGCTGCATCCCtgactccccgag 

pATGCACGCGTAGGG- 5 1 

20 S ' -pNNCNTACAGCTGCATCCCtgtgttgcgcgg 

pATGCACGCGTAGGG- 5 ' 

21 5 ' -pNNNCTACAGCTGCATCCCtctacagcagcg 
t j pATGCACGCGTAGGG- 5 • 

:* ; 22 5 ' -pNNGNTACAGCTGCATCCCtgtcgcgtcgtt 

pATGCACGCGTAGGG- 5 * 

23 5 ■ -pNNNGTACAGCTGCATCCCtcggagcaacct 

pATGCACGCGTAGGG- 5 1 

24 5 1 -pNNTNTACAGCTGCATCCCtggtgaccgtag 

pATGCACGCGTAGGG- 5 1 

25 5 ' -pNNNTTACAGCTGCATCCCtcccctgtcgga 

pATGCACGCGTAGGG- 5 ' 

where N and p are as defined above, and the nucleotides indicated in lower case letters arc the 
12-mer oligonucleotide tags. Each tag differs from every other by 6 nucleotides ' Equal molar 
5 quantities of each adaptor arc combined in NEB buffer No. 2 (New England Biosciences, 
Beverly, MA) to form a mixture at a concentration of 1000 pmol/^. 

Each of the 16 tag complements are separately synthesized as amino-dcrivatized 
oligonucleotides and arc each labelled with a fluorescein molecule (e.g. FAM, an NHS-ester 
of fluorescein, available from Molecular Probes, Eugene, OR) which is attached to the 5' end 
10 of the tag complement through a polyethylene glycol linker (Clonctech Laboratories, Palo 

Alto, CA). The sequences of the tag complements are simply the 12-mcr complements of the 
tags listed above. 
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Ligation of the adaptors to the target polynucleotide is carried out in a mixture 
consisting of 5 ul beads (20 mg), 3 uL NEB lOx ligase buffer, 5 uL adaptor mix (25 nM), 
2.5 uL NEB T4 DNA ligase (2000 units/uL), and 14.5 uL distilled water. The mixture is 
incubated at 16°C for 30 minutes, after which the beads arc washed 3 times in TE (pH 8.0). 
5 After centrirugation and removal of TE, the 3' phosphates of the ligated adaptors arc 

removed by treating the polynucleotide-bead mixture with calf intestinal alkaline phosphatase 
I (CIP) (New England Biolabs, Beverly, MA), using the manufacturer's protocol. After 

removal of the 3' phosphates, the CIP may be inactivated by proteolytic digestion, e .g. using 
I Pronasc™ (available form Boeringer Mannhicm, Indianapolis, IN), or an equivalent 

^ 10 protease, with the manufacturer's protocol. The polynucleotide-bead mixture is then washed 

and treated with a mixture of T4 polynucleotide kinase and T4 DNA ligase (New England 
? Biolabs, Beverly, MA) to add a 5' phosphate at the gap between the target polynucleotide and 

the adaptor to complete the ligation of the adaptors to the target polynucleotide. The bead- 
! polynucleotide mixture is then washed in TE. 

1 5 Separately, each of the labelled tag complements is applied to the polynucleotide-bead 

mixture under conditions which permit the formation of perfectly matched duplexes only 
between the oligonucleotide tags and their respective complements, after which the mixture is 
washed under stringent conditions, and the presence or absensc of a fluorescent signal is 
i measured. Tag complements are applied in a solution consisting of 25 nM tag complement 50 

% 20 mM NaCl, 3 mM Mg, 10 mM Tris-HCl (pH 8.5), at 20°C, incubated for 10 minutes, then 

washed in the same solution (without tag complement) for 10 minute at 55°C. 

After the four nucleotides are identified as described above, the encoded adaptors are 
cleaved from the polynucleotides with Bbv I using the manufacturer's protocol. After an 
mitial ligation and identification, the cycle of ligation, identification, and cleavage is repeated 
25 three times to give the sequence of the 1 6 terminal nucleotides of the target polynucleotide. 
Figure 4 illustrates the relative fluorescence from each of four tag complements applied to 
identify nucleotides at positions 5 through 16 (from the most d.stal from the bead to the most 
proximal to the bead). 

•>q Example 2 

i Construction and Sorting "f r.DNA Library for 

Signature Sequencing with Encoded Adaptors 
% t In this example, a cDNA library is constructed in which an oligonucleotide tag 

consisting of 8 four-nucleotide "words" is attached to each cDNA As described above, the 
35 repertoire of oligonucleotide tags of this size is sufficiently large (about 10«) so that if the 
cDNAs arc synthesized from a population of about 10 6 mRNAs. then there is a high 
probability that each cDNA will have a unique tag for sorting. After mRNA extraction, first 
strand synthesis is carried out in the presence of 5-Mc-dCTP (to block certain cDNA 
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restriction sites) and a biotinylatcd primer mixture containing the oligonucleotide tags. After 
conventional second strand synthesis, the tag-cDNA conjugates arc cleaved with Dpn II 
(which is unaffected by the 5-Mc-dcoxycytosines), the biotinylatcd portions are separated 
from the reaction mixture using streptavidin-coated magnetic beads, and the tag-cDNA 
conjugates arc recovered by cleaving them from the magnetic beads via a Bsm BI site carried 
by the biotinylated primer. The Bsm BI-Dpn II fragment containing the tag-cDN A conjugate 
is then inserted into a plasmid and amplified. After isolation of the plasmids, tag-cDNA 
conjugates are amplified out of the plasmids by PCR in the presence of 5-Mc-dCTP, using 
biotinylatcd and fluorcsccntly labelled primers containing pre-defined restriction cndonucleasc 
sites. After affinity purification with strcptavidin coated magnetic beads, the tag-cDNA 
conjugates are cleaved from the beads, treated with T4 DNA polymerase in the presence of 
dGTP to render the tags single stranded, and then combined with a repertoire of GMA beads 
having tag complements attached. After stringent hybridization and ligation, the GMA beads 
are sorted via FACS to produce an enriched population of GMA beads loaded with cDNAs. 
The enriched population of loaded GMA beads arc immobilized in a planar array in a flow 
chamber where base-by-base sequence takes place using encoded adaptors. 

Approximately 5 jig of poly(A + ) mRNA is extracted from DBY746 yeast cells using 
conventional protocols. First and second strand cDNA synthesis is carried out by combining 
100-150 pmolcs of the following primer (SEQ ID NO: 26): 

5 1 -biot i n-ACTAAT CGTCTC ACTATTT AATTAA f W , W , W , G ) 8 CC (T) iqV-3 » 

with the poly(A+) mRNA using a Stratagene (La Jolla, CA) cDNA Synthesis Kit in 
accordance with the manufacturer's protocol. This results in cDNAs whose first stand 
dcoxycytosincs arc methylated at the 5-carbon position. In the above formula, "V" is G, C, 
or A, M [W,W,W,Gr is a four-nuclcotidc word selected from Tabic II as described above, the 
single underlined portion is a Bsm Bl recognition site, and the double underlined portion is a 
Pac I recognition site After size fractionation (GIBCO-BRL cDNA Size Fractionation Kit) 
using conventional protocols, the cDNAs are digested with Dpn II (New England Bioscience, 
Beverly, MA) using manufacturer's protocol and affinity purified with streptavidin-coated 
magnetic beads (M-280 beads, Dynal AS., Oslo, Norway) The DNA captured by the beads 
is digested with Bsm BI to release the tag-cDNA conjugates for cloning into a modified 
pBCSK- vector (Stratagene, La Jolla, CA) using standard protocols. The pBCSK" vector is 
modified by adding a Bbs I site by inserting the following fragment (SEQ ID NO: 27) into the 
Kpn I/Eco RV digested vector. 

CGAAGACCC 
3 1 -CATGGCTTCTGGGG AT A- 5 1 
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— — a standard p.«- -H* *° - 
After isolating the above pBCSk vector rrom a »uu k 

, ^ hv PCR in the presence of 5-Me-dCTP using 20-mcr pnmers 
eDNA conjugates are amplified by PCRm the pres 

trt vec t or sequences flanking the tag-cDN A insert, me upsircu. f 

-<•— « » «« ^ u. adjacent «o «. 

U adjacent 1° the tafc is biMinyiateo an . gj^., rifloi 

fiigi j. , abeM „jth fluorescein After amplication, the PCR product 9 P 
r vTJTpac I » -*ase ..Wed ta 8 <DNA conjugates. The tags of 

^cofdGTP. Aflcrthe reaction is ouench^i, *e u 8 -cDNA con^ ,s puoned by 
presence 01 u« . . with 5 5 mm GMA beads carrying tag 

pheno.-chloroform extraction and combined ^ ^ 
„ir. m nnt<: each taa comp cmcnt having a 5 phosphate, nyonui 

l^n the p esence of a thermal stable hgase so that on.y tags forming 
stringent conditions in the presen The GMA beads arc washed 

5 oerfectly matched duplexes with their complements are hgatcd. The GMA be 
5 perfectly ma v fluorcscently labelled 

and the loaded beads arc concentrated by r ACS sorting, us b 

TdN As to ident.1V loaded GMA bead, The tag-cDNA conjugates attached to the GMA 
Zs Ire digested w,th D P n 1. to remove the fluorescent ,abe, and treated with alkaline 
phosphatase to prepare the cDNAs for se^ncing^ 
, 0 The following cleavage adaptor (SEQ ID NO. «) ufc 

and phosphatase treated cDNAs: 

5 • -pGATCAGCTGCTGCAAATTT 
pTCGACGACGTTTAAA 

cDNAs as described above. 

A flo>v chamber (500) d.agrammauca.ly "P-"* ■ 5 " ^ 
etching a cavity havmg a « »>« <S02) and out*. (504) in a glass ^ (506) 

m ic— g e g. Efa» « al, .ntema-ona, paten, apptcau* 

^ S E9I«K,327 (W091/16966); Brown. U.S. patent 4.9. ..782. Harnson « a. Ana.. 
ST « Z..9T2 ,.992); and the hhc. The *— of flow cham to ,5C* arc such 
loaded tnicroparticlc, CM), e.g. GMA Kads, may be dtsposcd ,. cavrty (5 .0, .» a 

pi- — of .00-200 ^s. Ca»,,y (5 ,0, * — 

etched glass plate (506), e.g. Pomerantz, U.S. patent 3,397,27V. K g 

. /^lAtKmnoh 5201 through valve block (522) 
the flow chamber from syringe pumps (514 through WU) tnro h 



25 



30 



35 
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controlled by a microprocessor as is commonly used on automated DNA and peptide 
synthesizers, e.g. Bridgham et al, U.S. patent 4,668,479; Hood ct al, U.S. patent 4,252,769; 
Barstow ct al, U.S. patent 5,203,368; Hunkapiller, U.S. patent 4,703,913; or the like. 

Three cycles of ligation, identification, and cleavage arc carried out in flow chamber 
(500) to give the sequences of 12 nucleotides at the termini of each of appoximatcly 100,000 
cDNAs. Nucleotides of the cDNAs are identified by hybridizing tag complements to the 
encoded adaptors as described in Example 1 . Specifically hybridized tag complements are 
detected by exciting their fluorescent labels with illumination beam (524) from light source 
(526), which may be a laser, mercury arc lamp, or the like. Illumination beam (524) passes 
through filter (528) and excites the fluorescent labels on tag complements specifically 
hybridized to encoded adaptors in flow chamber (500). Resulting fluorescence (530) is 
collected by confocal microscope (532), passed through filter (534), and directed to CCD 
camera (536), which creates an electronic image of the bead array for processing and analysis 
by workstation (538). Preferably, after each ligation and cleavage step, the cDNAs are 
treated with Pronase™ or like enzyme. Encoded adaptors and T4 DNA ligasc (Promega, 
Madison, WI) at about 0.75 units per nL arc passed through the flow chamber at a flow rate 
of about 1-2 nL per minute for about 20-30 minutes at 16°C, after which 3* phosphates arc 
removed from the adaptors and the cDNAs prepared for second strand ligation by passing a 
mixture of alkaline phosphatase (New England Bioscience, Beverly, MA) at 0.02 units per jiL 
and T4 DNA kinase (New England Bioscience, Beverly, MA) at 7 units per ^iL through the 
flow chamber at 37°C with a flow rate of 1-2 per minute for 15-20 minutes. Ligation is 
accomplished by T4 DNA ligase (.75 units per mL, Promega) through the flow chamber for 
20-30 minutes. Tag complements at 25 nM concentration arc passed through the flow 
chamber at a flow rate of 1-2 per minute for 10 minutes at 20°C, after which fluorescent 
labels carried by the tag complements arc illuminated and fluorescence is collected. The tag 
complements arc melted from the encoded adaptors by passing hybridization buffer through 
the flow chamber at a flow rate of 1-2 nL per minute at 55°C for 10 minutes Encoded 
adaptors arc cleaved from the cDNAs by passing Bbv I (New England Biosciences, Beverly, 
MA) at 1 unit/^L at a flow rate of 1-2 \iL per minute for 20 minutes at 37°C. 
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APPENDIX I 

Exemolaiv r.nm r „t w ^f ram for f fener,^ 

minimally r.rncc hvbridi^n^ c»tc 
(single stranded tag/single stranded tag complement) 



Program tagN 
c 



c 
c 
c 
c 
c 
c 



c 
c 



c 
c 



c 
c 

c 
c 
c 
c 
c 



c 
c 
c 
c 
c 
c 



Program tagN generates minimally cross-hvbriHs , • 

3 =.f a. four -^Au^r.-rs 2^rx. 



character*! subl(20) 

integers mset ( 10000, 20) , nbasef20) 



write ( * , * ) ' ENTER SUBUNIT LENGTH ' 
read(* # l00)nsub 
100 format (i2) 



writef*,*) -ENTER SUBUNIT SEQUENCE ' 
ilO format (20al) 



ndif f=lo 



Let a=l c =2 q=3 



& t=4 



do 800 kk=l,nsub 

if (subl(kk) .eq. 'a') then 

mset (l,kk)=l 

endif 

if (subl(kk) .eq. 'c') then 
mset (1, kk)=2 
endif 

if (subi(kk) .eq. 'g') then 

n«et(l,kk)=3 
endif 

if (subl(kk) .eq. 'f) then 
mset(l / kk)=4 
800 continue endif 



jj=i 



-48 



PCT/US97/09472 



WO 97/46704 



3 

■A 



do 1000 kl=l,3 
do 1000 k2=l,3 
do 1000 k3=l,3 
do 1000 k4=l,3 
do 1000 k5=l,3 
do 1000 k6=l,3 
do 1000 k7=l,3 
do 1000 k8=l,3 
do 1000 k9=l,3 
do 1000 kl0=l,3 

do 1000 kll=l,3 
do 1000 kl2=l,3 
do 1000 kl3=l,3 
do 1000 kl4=l,3 
do 1000 kl5=l,3 
do 1000 kl6=l,3 
do 1000 kl7=l,3 
do 1000 kl8=l,3 
do 1000 kl9=l,3 
do 1000 k20=l,3 



c 
c 



■w 



C 

c 



nbase ( 1) = 
nbase (2 } - 
nbase (3)= 
nbase (4) = 
nbase (5) = 
nbase (6) = 
nbase (7) = 
nbase (8) = 
nbase ( 9) = 
nbase (10) 
nbase (11) 
nbase (12) 
nbase (13) 
nbase (14) 
nbase (15) 
nbase (16) 
nbase (17 ) 
nbase (18) 
nbase(19) 
nbase (20) 



kl 
k2 
k3 
k4 
k5 
k6 
k7 
k8 
k9 

=kl0 
=kll 
-kl2 
=kl3 
= kl4 
=kl5 
=kl6 
=kl7 
=kl8 
= kl9 
= k20 



do 1250 nn=l,jj 



1 
2 
3 



1200 



c 
c 



n=0 

do 1200 j=l,nsub 

if(mset(nn,j).eq.l .and. nbase D-ne.l 
mset(nn,j).eq.2 .and. nbase( 3 .ne.2 
mset(nn, j) .eq.3 .and. nbase ( 3 ) - ne . 3 



or . 
or . 

, or . 



msetlnn r j).eq.4 .and. nbase ( 3 ) . ne . 4 ) then 

n=n+l 
endif 
continue 



1250 

c 

c 



if (n.it.ndif f ) then 

goto 1000 

endif 
continue 
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1100 
c 
c 

1000 continue 
c 
c 



c 
c 

c 



jj=jj+l . . 

write (*, 130) (nbase (i) , i=l,nsub) , 33 

do 1100 i=l,nsub 

mset ( 3 j,i)=nbase(i) 

continue 



writet*,*) 
130 format (10x # 20 (lx.il) ,5x,i5) 

write (*,*) 
write{*,120) jj 
120 format (Ix, 'Number of words=',i5) 



end 



c 
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APPENDIX II 

Exemplary computer program for generating 
minimally cross hybridizing sets 
(double stranded tag/single stranded tag complement) 

Program 3tagN 



c 
c 

c Program 3tagN generates minimally cross-hybridizing 

c sets of triplex words given i) N — subunit length, (ii) 

c an initial subunit sequence, and iii) the identity 

c of the nucleotides making up the subunits, i.e. 

c whether the subunits consist of all four 

c nucleotides, 

c or some subset of nucleotides. 

c 

c 



character*! subl(20) 

integer*2 mset ( 10000, 20 } , nbase(20) 

c 
c 

nsub=20 
ndiff=6 

c 
c 

write ( * , * ) 1 ENTER SUBUNIT SEQUENCE: a & g only 1 

read(*,110) {subl ( k) , k=l , nsub) 
110 format(20al) 
c 

c Generate set of words differing from 

c subl by at least three ndiff nucleotides. 

c 

c 

c Translate a's & g's into numbers with a=l & g=2 

c 

c 

do 800 kk=l,nsub 

if (subl (kk) .eq. 'a 1 ) then 

mset {1, kk) =1 

endif 

if (subl <kk) .eq. 'g' ) then 
mset (1, kk)=2 
endif 

800 continue 

c 

c 

jj=l 

c 
c 

do 1000 kl=l,2 
do 1000 k2=l,2 
do 1000 k3=l,2 
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do 1000 k4=l,2 
do 1000 k5=l,2 
do 1000 k6=l,2 
do 1000 k7=l,2 
do 1000 k8=l,2 
do 1000 k9=l f 2 
do 1000 kl0=l,2 

do 1000 kll=l,2 
do 1000 kl2=l,2 
do 1000 kl3=l,2 
' do 1000 kl4=l,2 

do 1000 kl5=l,2 
do 1000 kl6=l,2 

I do 1000 kl7=l,2 



do 1000 kl8=l,2 
do 1000 kl9=l,2 

do 1000 k20=l,2 



£ c 

C 



c 
c 



1200 
c 



nbase <l)=kl 

nbase(2)=k2 

nbase(3)=k3 

nbase(4)=k4 

nbase(5)=k5 

nbase(6)=k6 

nbase{7)=k7 

nbase(8)=k8 

nbase(9)=k9 

nbase(10)=kl0 

nbase(ll)=kll 

nbase(12)=kl2 

nbase(13)=kl3 

nbase(14)=kl4 

nbase<15)=k!5 

nbase(16)=kl6 

nbase ( 17 )=kl7 

nbase(18)=kl8 

nbase(19)=kl9 

nbase(20)=k20 



do 1250 nn=l,jj 



** x mset(nn,j) -eq 

n=n+l 



n=0 

do 1200 j=l,nsub 

if(«set(nn,i>.eq.X .and. nbase (j 1 i.ne . 1 . or. 

^inn.i).ea.2 .and. nbase ( 3 ) . ne . 2 ) then 



endif 
continue 



if (n.lt.ndiff) then 
goto 1000 
endif 

1250 continue 
c 
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1100 
c 
c 

1000 continue 
c 



c 

130 



120 

c 

c 



writer, 130) (nbase (i ) , i=l , nsub) , j j 
do 1100 i=l,nsub 

msettjj, i)=nbase(i) 
continue 



writer,*) 

format(Sx,20(lx,il),5x,i5) 
writer, *) 

writer, 120) jj _ 
format (lx, 'Number of words-', ib) 



end 
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SEQUENCE LISTING 

(1) GENERAL INFORMATION: 

(i) APPLICANT: Lynx Therapeutics, Inc. 

(ii) TITLE OF INVENTION: Massively Parallel Signature S equenc i nq bv 
Ligation of Encoded Adaptors 4 ncing ^ 

(iii) NUMBER OF SEQUENCES : 28 



■v 

I (iv) CORRESPONDENCE ADDRESS: 



(A) ADDRESSEE: Dehlinger & Associates 

(B) STREET: 350 Cambridge Avenue, Suite 250 

(C) CITY: Palo Alto 

(D) STATE: CA 

(E) COUNTRY: USA 

(F) ZIP: 94306 

(v) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: Floppy disk 

(B) COMPUTER: IBM PC compatible 

(C) OPERATING SYSTEM: PC- DOS/MS -DOS 

(D) SOFTWARE : Patentln Release U1.0, Version #1.25 

| (vi) CURRENT APPLICATION DATA: 

* (A) APPLICATION NUMBER: 

(B) FILING DATE: 

(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: US 08/689,587 

(B) FILING DATE: 12-AUG-96 

(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: US 08/659,453 

(B) FILING DATE: 06-JUN-96 

(viii) ATTORNEY/ AGENT INFORMATION: 

(A) NAME: Powers, Vincent M. 

(B) REGISTRATION NUMBER: 36,246 

(CJ REFERENCE/ DOCKET NUMBER^ 5525-0029 . 4 1/808-lwo 

(ix) TELECOMMUNICATION INFORMATION: 

(A) TELEPHONE: (415) 324-0880 

(B) TELEFAX: (415) 324-0960 

<i> SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 28 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 1: 
TGGATTCTAG AGAGAGAGAG AGAGAGAG 

2 6 
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(2) INFORMATION FOR SEQ ID NO: 2: 

(i ) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 16 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

| (xi > SEQUENCE DESCRIPTION : SEQ ID NO: 2: 

l NNGGATGNNN NNNNNN 

I 

,2} INFORMATION FOR SEQ ID NO: 3: 

^ (i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

< xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 
NRRGATCYNN N 

(2) INFORMATION FOR SEQ ID NO: 4: 

(i , SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(XI) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 
AGTGGCTGGG CATCGGACCG 

(2) INFORMATION FOR SEQ ID NO: 5: 

(1 ) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(Ki ) SEQUENCE DESCRIPTION: SEQ ID NO: 5 
GGGGCCCAGT CAGCGTCGAT 
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(2) INFORMATION FOR SEQ ID NO: 6: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 70 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 
<D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 
CCTCTTCCTC SSSS "^C« 



(2) INFORMATION FOR SEQ ID NO: 7: 

(i) SEQUENCE CHARACTERISTICS • 

(A) LENGTH: 34 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: sinale 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 7: 



CTAAACCATT GGTATGGGCC AGTGAATTGT 



(2) INFORMATION FOR SEQ ID NO: 8: 

(i) SEQUENCE CHARACTERISTICS • 

(A) LENGTH: 55 nucleotides 
<B) TYPE: nucleic acid 
<C> STRANDEDNESS: sinale 
(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID 



AATA 



NO: 8: 



CGCGCAGCCC GCATCGTTTA TGCTACAGAC TGTCAGTGCA 
GCTCTCCGAT CCAAA 



(2) INFORMATION FOR SEQ ID NO: 9: 

(i) SEQUENCE CHARACTERISTICS • 

(A) LENGTH : 16 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(Xi) SEQUENCE DESCRIPTION : SEQ ID NO: 9 
TCGATGGGCT AGATTT 

-56- 
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(2) INFORMATION TOR SEQ ID NO: 10: 

(D) TOPOLOGY: linear 
(xi) SEQUENCE DESCRIPTION: SEQ ID WO: 10: 



ANNNTACAGC TGCATCCCTT 



GGCGCTGAGG 



(2) INFORMATION FOR SEQ ID NO: 11: 

i n\ TYPE: nucleic a"Q 
C) sTRANDEDNESS: double 
(D) TOPOLOGY: linear 

uil SEQUENCE DESCRIPTION: SEQ ID NO: 11: 
NANNTACAGC TGCATCCCTG GGCCTGTAAG 

(2) INFORMATION FOR SEQ ID NO: 12: 

(D ) TOPOLOGY: linear 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 12: 

CNNNTACAGC TGCATCCCTT GACGGGTCTC 
(2 ) INFORMATION FOR SEQ ID NO: 13: 



SFOUENCE CHARACTERISTICS: 
U) ^ A) LENGTH : 30 nucleotides 
(B) TYPE: nucleic acid 
(C STRANDEDNESS: double 
(D) TOPOLOGY: linear 



30 
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NCNNTACAGC TGCATCCCTG CCCGCACAGT 



(2) INFORMATION FOR SEQ ID NO: 14: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 14: 



GNNNTACAGC TGCATCCCTT CGCCTCGGAC 



(2) INFORMATION FOR SEQ ID NO: 15: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 15: 



NGNNTACAGC TGCATCCCTG ATCCGCTAGC 



(2) INFORMATION FOR SEQ ID NO: 16: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 16: 



TNNNTACAGC TGCATCCCTT CCGAACCCGC 
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(2) INFORMATION FOR SEQ ID NO: 17: 

' (i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 
<B) TYPE: nucleic acid 

(C) STRANDEDNESS : double 

(D) TOPOLOGY: linear 

% <xi) SEQUENCE DESCRIPTION: SEQ ID NO: 17 



NTNNTACAGC TGCATCCCTG AGGGGGATAG 
(2) INFORMATION FOR SEQ ID NO: 18: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 
NNANTACAGC TGCATCCCTT CCCGCTACAC 



(2) INFORMATION FOR SEQ ID NO: 19: 

(i> SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 19 



NNNATACAGC TGCATCCCTG ACTCCCCGAG 



(2) INFORMATION FOR SEQ ID NO: 20: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 
<D) TOPOLOGY: linear 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 20 



NNCNTACAGC TGCATCCCTG TGTTGCGCGG 
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(2) INFORMATION FOR SEQ ID NO: 21: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 21; 



| NNNCTACAGC TGCATCCCTC TACAGCAGCG 

? 

(2) INFORMATION FOR SEQ ID NO: 22: 

" (i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 
>. (B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION : SEQ ID NO: 22 



NNGNTACAGC TGCATCCCTG TCGCGTCGTT 



(2) INFORMATION FOR SEQ ID NO: 23 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 23: 

NNNGTACAGC TGCATCCCTC GGAGCAACCT 
(2) INFORMATION FOR SEQ ID NO: 24: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 30 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 24: 
NNTNTACAGC TGCATCCCTG GTGACCGTAG 
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< 21 INFORMATION FOR SEQ ID NO: 2S: 

jo type: nucleic acid 
C STRANDEDNESS: double 
(D) TOPOLOGY: linear 

(xi > SEQUENCE DESCRIPTION: SEQ ID NO: 25: 
NNNTTACAGC TGCATCCCTC CCCTGTCGGA 
( 2) INFORMATION FOR SEQ ID NO: 26: 

, • » SFOUENCE CHARACTERISTICS : 
U > ^LENGTH: 78 nucleoid.. 

IB) TYPE: nucleic acia 
C STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 26: 

ACTAATCGTC TCACTATTTA ATTAANNNNN NNNNNNNNNN 
NNNNNNNNNN NNNNNNNGGT TTTTTTTTTT TTTTTTTV 

(2) INFORMATION FOR SEQ ID NO: 27: 

,i\ "5FOUENCE CHARACTERISTICS: 
u) LENGTH : 17 nucleotides 

(Bl TYPE: nucleic a <= ld 
C STRANDEDNESS : double 

(D) TOPOLOGY: linear 

(xl) SEQUENCE DESCRIPTION: SEQ ID NO: 27: 
ATAGGGGTCT TCGGTAC 

(2) INFORMATION FOR SEQ ID NO: 28: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 19 nucleotides 

(B) TYPE: nucleic a<= ld 

(C) STRANDEDNESS : double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 28: 
GATCAGCTGC TGCAAATTT 
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We claim: 

1 . A method of detemuning the nucleotide sequence at an end of a polynucleotide, the 

5 method comprising the steps of: 

Hgating one or more encoded adaptors to an end of the polynucleotide, each encoded 
adaptor having an oligonucleotide tag selected from a rninimally cross-hybridizing set of 
oligonucleotides and a protruding strand complementary to a portion of a strand of the 

polynucleotide; and 

10 identifying one or more nucleotides in each of said portions of the strand of the 

polynucleotide by specifically hybridizing a tag complement to each oligonucleotide tag of the 
one or more encoded adaptors ligated thereto. 

2 The method of claim 1 wherein said step of ligating includes Hgating a plurality of 
* 5 different encoded adaptors to said end of said nucleotide such that said protruding strands 
of the plurality of different encoded adapators are complementary to a plurahty of different 
portions of said strand of said polynucleotide such that there .s a one-to-one correspondence 



between 



said different encoded adaptors and the different portions of said strand. 



20 3. The method of claim 2 wherein said different portions of sa.d strand of said 
polynucleotide arc contiguous. 

4 The method of any of claims 1 to 3 wherein said protruding strand of said encoded 
adaptors contains from 2 to 6 nucleotides and wherein said step of identifying includes 

25 specifically hybr.diz.ng said tag complements to said oligonucleotide tags such that the 

identity of each nuclcot.de .n said portions of said polynucleotide is determined successively. 

5 The method of any of claims 1 to 4 wherein said step of identifying further includes 
providing a number of sets of tag complements equivalent to the number of nucleotides to be 

30 identified in said portions of said polynucleotide. 

6 The method of claim 5 wherein said step of identifying further includes providing said 
tag complements in each of said sets that are capable of indicating the presence of a 
predetermined nucleotide by a signal generated by a fluorescent signal generating mo.ety, 

35 there being a different fluorescent signal generating moiety for each kind ofnucleot.de. 

7 The method of any of claims I to 6 wherein said oligonucleotide tags of said encoded 
daptors are single stranded and said tag complements to said oHgonuclcotide tags arc single 



a 
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stranded such that specific hybridization between an oligonucleotide tag and its respective tag 
complement occurs through Watson-Crick base pairing. 

8. The method of claim 7 wherein said encoded adaptors have the form: 

5'-p<N) n (N ) r (N ) S (N ) (N > t -3' 
z(N') r (N') s <N') q -5' 

or 

p(N ) r {N ) S (N ) q (N ) t -3' 
3'-z(N) n (NM r (N , ) s (N') q -5» 

where N is a nucleotide and N' is its complement, p is a phosphate group, z is a 3* hydroxyl or 
a 3' blocking group, n is an integer between 2 and 6, inclusive, r is an integer between 0 and 
18, inclusive, s is an integer which is either between four and six, inclusive, whenever the 
encoded adaptor has a nuclease recognition site or is 0 whenever there is no nuclease 
recognition site, q is an integer greater than or equal to 0, and t is an integer greater than or 
equal to 8. 

9. The method of claim 8 wherein r is between 0 and 12, inclusive, t is an integer 
between 8 and 20, inclusive, and z is a phosphate group. 

10. The method of any of claims 1 to 6 wherein said oligonucleotide tags of said encoded 
adaptors arc double stranded and said tag complements to said oligonucleotide tags arc single 
stranded such that specific hybridization between an oligonucleotide tag and its respective tag 
complement occurs through the formation of a Hoogstecn or reverse Hoogstccn triplex. 

1 1 . The method of claim 10 wherein said encoded adaptors have the form: 

5'-p(N) n (N ) r (N ) S (N ) (N ) t -3' 
2(N» ) r (N') s (N') q (N ) t -5' 

or 

p(N ) r (N ) S (N ) (N ) t -3' 
3'-z<N) n (N') r (N') s (N') q (N ) t -5' 

where N is a nucleotide and N' is its complement, p is a phosphate group, z is a 3* hydroxyl or 
a 3' blocking group, n is an integer between 2 and 6, inclusive, r is an integer between 0 and 
18, inclusive, s is an integer which is cither between four and six, inclusive, whenever the 
encoded adaptor has a nuclease recognition site or is 0 whenever there is no nuclease 
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rccogmfon s,tc. q ,s on mteger greater than or equal to 0, and t is an integer greater than or 
equal to 8. 

12. The method of claim 1 1 wherein r , s between 0 and 12, inclusive, t is an integer 
5 between 1 2 and 24, inclusive, and z is a phosphate group. 

13. The method of any of claims 1 to 12 wherein members of said minimally cross, 
hybridizing set differ from every other member by at least six nucleotides. 

tne method comprising the steps of: 

(a) attaching a first oligonucleotide tag from a repertoire of tags to each 
polynucleotide in a population of polynucleotides such that each firs, oligonucleotide tag from 
the repertoire is selected from a first iriinimally cross-hybridizing set; 

(b) sampling the population of polynucleotides to form a sample of polynucleotides 
such that substantially all different polynucleotides in the sample have different first 
oligonucleotide tags attached; 

(c) sorting the polynucleotides of the sample by specifically hybridizing the first 
oligonucleotide tags with their respective complements, the respective complements being 
attached as uniform populations of substantially identical oligonucleotides in spatially discrete 
regions on the one or more solid phase supports; 

(d) ligating one or more encoded adaptors to an <md of the polynucleotides in the 
sample, each encoded adaptor having a second oligonucleotide tag selected from a second 
nunimally cross-hybridizing set and a protruding strand complementary to a protruding strand 

25 of a polynucleotide of the population; and s»«na 

(e) identifying a plurality of nucleotides in said protruding strands of the 
polynucleotides by specifically hybridizing a tag complement to each second oligonucleotide 

tag ot the one or more encoded adaptors. 

30 15. The method of claim 14 further including the steps of (f) cleaving said encoded 

adaptors from said polynucleotides with a nuclease having a nuclease recognition site separate 
from its cleavage site so that a new protruduig strand ,s formed on said end of each of said 
j polynucleotides, and (g) repeating steps (d) through (f). 



20 



35 16. 



steps of: 



A method of identifying a population of mRNA molecules, the method comprising the 
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^J^TJ eachcDNA — * - . - *— * -» - ** — 

, • j . eheinB selected from a first minimaUycross.hybndiz.ng set, 

XC: f SI* i~ — - - «-* *— - — 

p0 p» Jon cLh ^ ^.or M a second us M fc- . 

of, cDNA ** -< ^ _ ^ ^ of _ 

IS (c) determining the identity ana oraenng u H 

pro truding strands of the cDN A molecules by specifically hybridizing a tag complement to 
each second oligonucleotide tag of the one or more encoded adaptors; 

ThLn the populate of mRNA molecules is identtfed by the fluency d.stnbuUon 
of the portions of sequences of the cDNA molecules. 

20 17 The method of claun 16 further includmg the steps of (f) cleaving said encoded 

adaptors from said po.ynucleot.des *ft a nuclease hav.ng a nuclease recogninon s,tcsepa«tc 
from fa Ceavage site so that a ne. protrudtng «d is formed on sa.d end of each of sa.d 
cDNA molecules, and (g) repeating steps (d) through (f). 

25 18 . method of determining the nucleotide sequence at an end of a Nucleotide, the 

method comprising the steps of: 

(a) Ugaung an encoded adaptor to an end of the po.ynuc.eot.de, the encoded adaptor 
having an oligonucleotide tag selected from a mimmally cross-hybnd.zn,g set of 

30 ohgonucleotides and a protruding strand complementary to a portion of a strand of the 

polynucleotide; 

(b) identifying one or more nucleotides in the port.on of the strand of the 
polynucleotide by specifically hybridizing a tag complement to the ohgonucleotide tag of the 

encoded adaptor ligated thereto; 
35 (C ) clcav,ng the encoded adaptor from the end of the polynucleot.de W .th a nuclease 

having a nuclease recognition site separate from ,ts cleavage site so that a ne W protrudmg 
strand is formed at the end of the polynucleotide; and 
(d) repeating steps (a) through (c). 
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19. The method of claim 18 wherein said protruding strand of said encoded adaptor 
contains from 2 to 6 nucleotides and wherein step of identifying includes specifically 
hybndizmg successive said tag complements to said oligonucleotide tag such that the identity 
of each nucleotide in said portion of said polynucleotide is determined successively. 

20. The method of claim 18 or claim 19 wherein said step of identifying further includes 
providing a number of sets of tag complements equivalent to the number of nucleotides to be 
identified in said portion of said polynucleotide. 

21. The method of claim 20 wherein said step of .dentifying further includes providing 
said tag complements in each of said sets that arc capable of indicating the presence of a 
predetermined nucleotide by a signal generated by a fluorescent signal generating moiety 
there being a different fluorescent signal generating moiety for each kind of nucleotide. ' 

22. The method of any of claims ISto2l wherein said oligonucleotide tags of said 
encoded adaptors are single stranded and said tag complements to said oligonucleotide tags 
are single stranded such that specific hybridization between an oligonucleotide tag and its 
respective tag complement occurs through Watson-Crick base pairing. 



23. 

the form 



A composition of matter comprising a double stranded oligonucleotide adaptor having 



25 



or 



5'-p(N) n (N ) r(N ) s(N ) ( N , 3 , 
z(N') r (N') s (N')„-5' 



30 



35 



P(N ) r (N ) ( N ) (N ) t -3» 
■ 2 Wn(N') r (N').(N')l5' 



w ere N is a nucleot.de and N' ,s its complement, p , s a phosphate group, z ,s a 3' hydroxyl or 
. 3 blocking group, n is an integer between 2 and 6. inclusive, r is an integer between 0 and 
1 8. inc.us.ve, s is an integer which .s either between four and six, inclusive, whenever the 
encoded adaptor has a nuclease recognition site or is 0 whenever there is no nuclease 
recognmon s.te, q is an mtcger greater than or equal to 0, t is an integer greater than or coual 
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24. The composition of claim 23 wherein r is between 0 and 12, inclusive, t is an integer 
between 8 and 20, inclusive, z is a phosphate group, and said single stranded moiety (N), is a 
member of a minimally cross-hybridizing set. 

25 A composition of matter comprising a double stranded oligonucleotide adaptor having 
the form: 



5'-p(N) 



or 



n(N ) r (N >s( N >a< N Jt" 3 ' 
2(N') r (M»),(N')q(N') t -5' 



P(N ) r (N ) S (N ) (N ) t -3« 
-Z(K) n (N«) r (N') s (N')q(N') t -5' 



where N is a nucleotide and K is its complement, p is a phosphate group, z is a 3' hydroxyl or 
a 3' blocking group, n is an integer between 2 and 6, inclusive, r is an integer between 0 and 
18, inclusive, s is an integer which is either between four and six, inclusive, whenever the 
encoded adaptor has a nuclease recognition site or is 0 whenever there is no nuclease 
recognition site, q is an integer greater than or equal to 0, and t is an integer greater than or 
equal to 8. 

26. The composition of claim 25 wherein r is between 0 and 12, inclusive, t is an integer 
between 12 and 24, inclusive, z is a phosphate group, and said double stranded moiety 

-(N ) t 
"(N') t 

is a member of a minimally cross-hybridizing set. 

27. The composition of any of claims 23 to 26 wherein n equals 4 and wherein members 
of said ininimally cross-hybridizing set differ from every other member by at least six 
nucleotides. 
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